Adaptive multi-task learning for speech to text translation

End-to-end speech to text translation aims to directly translate speech from one language into text in another, posing a challenging cross-modal task particularly in scenarios of limited data. Multi-task learning serves as an effective strategy for knowledge sharing between speech translation and machine translation, which allows models to leverage extensive machine translation data to learn the mapping between source and target languages, thereby improving the performance of speech translation. However, in multi-task learning, finding a set of weights that balances various tasks is challenging and computationally expensive. We proposed an adaptive multi-task learning method to dynamically adjust multi-task weights based on the proportional losses incurred during training, enabling adaptive balance in multi-task learning for speech to text translation. Moreover, inherent representation disparities across different modalities impede speech translation models from harnessing textual data effectively. To bridge the gap across different modalities, we proposed to apply optimal transport in the input of end-to-end model to find the alignment between speech and text sequences and learn the shared representations between them. Experimental results show that our method effectively improved the performance on the Tibetan-Chinese, English-German, and English-French


Introduction
A speech-to-text translation (ST) system is commonly a pipeline framework, which consists of two components, an automatic speech recognition (ASR) model and a machine translation (MT) model [1,2].The source language speech is transcribed by the speech recognition model, and then the transcribed text is translated by the MT model into target language text.However, such cascaded models suffer from error propagation and high latency.Recent works proposed an end-to-end speech translation (E2E ST) model [3,4], which provides an effective solution by jointly optimizing a single model for conversions from source language speech to target language text.Although the E2E ST model has the advantages above, its special nature as a cross-modal and cross-language task introduces a challenge-data scarcity.Therefore, present research usually leverages knowledge acquired from MT tasks to assist in the training of ST models.
For low-resource languages, multi-task learning frameworks are commonly used to achieve knowledge sharing between tasks, thereby improving the performance of the model on the target task.The performance of a multi-task model on each task tends to improve as the corresponding weight assigned to it increases.However, when the weights exceed a certain threshold, the model's performance gradually decreases [5].Different combinations of weights lead to variations in model performance.Therefore, it is an important concern how to allocate weights for each task in order to achieve optimal performance on the target task.There are two typical methods for adjusting multi-task weights.One approach is to manually assign weights to each task and continuously experiment with various weight combinations in search of the best combination.The other approach is to use dynamic adjustment techniques [6].Compared to manual adjustment of task weights, dynamically adjusting them during training allows for faster and more efficient identification of optimal weight combinations [7,8].In this paper, we use the adaptive cross-entropy loss function based on task loss proportion as our multi-task objective function.This method is not only feasible but also achieves better allocation for weights.
Additionally, due to the modality gap between speech and text, ST cannot learn MT knowledge well, resulting in ST performance often lagging behind MT tasks.Previous studies have proved that when the input speech representation is similar to its corresponding text representation, information is better transferred from MT task to ST task, leading to improved ST performance [9].Therefore, we proposed obtaining representations of speech and text that are close to each other in Wasserstein [10] space by optimal transport (OT) methods to reduce the gap between the speech representation and the corresponding transcription.By the cross-language conversion ability learned in MT tasks to ST tasks, the ST model can learn the better correspondence between source language speech and target language text with a small amount of parallel corpus data.
Our contributions are as follows: (1) a weight-updating scheme based on loss proportion is adopted to dynamically adjust the weights of each task during model training.Thus, the adaptive ability of the multi-task ST model is improved.(2) Based on the multi-task training framework, we introduce the cross-modal optimal transport method for ST, which reduces the gap between speech representation and corresponding transcription.(3) Experimental results on public speech translation datasets show that the proposed method can significantly improve the model performance.

End-to-end ST
To overcome error propagation and reduce latency of cascade ST systems, Bérard et al. [3] proposed to use an end-to-end architecture to directly translate speech into text in another language without intermediate transcription, which has become the dominant paradigm in recent years.However, the development of ST has been hampered by the scarcity of ST data and the cross-modal and cross-language characteristics.To address this problem, researchers usually use pre-training [11][12][13], multitask learning [14][15][16], and knowledge distillation [17,18] to introduce additional data and other tasks to improve performance.

Multi-task learning
Multi-task learning aims to enhance the target task by using related auxiliary tasks.Although multi-task learning is effective, manually adjusting the weights of each task is indeed a tedious task.Therefore, dynamic adjustment of weights is usually used, which can be broadly categorized into two types: gradient-based methods and loss-based methods.Among the gradient-based methods, Chen et al. [7] studied the gradients from different tasks and conduct task dependent gradient normalization to encourage different tasks to learn at similar speed.For loss-based methods, Kendall et al. [5] weighed multiple loss functions by taking into account the mean square error uncertainty of each task.Liu et al. [8] proposed the dynamic weighted average (DWA) method, which uses the average of task losses over time to measure task losses.However, these methods usually add extra complexity to the training phase.In this paper, we employ a weight updating scheme based on loss proportions for automatically adjusting multi-task weights.

Optimal transport
OT is a classical mathematical problem.It is commonly used to describe the transfer cost between two distributions.Villani et al. [19] provided a systematic and comprehensive exposition of the OT theory.In recent years, this theory has been widely used in research to find consistency between languages or modalities.Chen et al. [20] used OT in image-text pre-training to achieve finegrained alignment between words and image regions.Gu et al. [21] used OT to bridge the gap between semantically equivalent representations of different languages in the field of MT, and Zhou et al. [22] used OT to integrate two modal representations that are mixed to overcome the modality gap between speech and text to improve the performance of ST.Compared to this approach, in this paper, we obtain representations of speech and text close to each other in the Wasserstein space through OT to reduce the gap between the speech representation and the corresponding transcription.

Bridging the modality gap
It is still difficult to fully use MT data using the above techniques due to the modal differences between speech and text.Several works have attempted to bridge this gap.Liu et al. [23] reduced the length of speech representations to match text representations and narrowed the representation gap by minimizing their L2 distance; Xu et al. [13] mapped speech representations to text representations by connecting temporal classification and mapping layers; Fang et al. [24] blended sequences of speech and text representations in order to bridge the modality gap; Han et al. [25] projected speech and text features into a shared semantic space; Zhou et al. [22] mixed speech and text sequences across modalities through optimal transport; and Ye et al. [9] brought sentence-level representations closer together through contrast learning.Different from previous studies, in this paper, we reduce the modality gap from the embedding representation between speech and text, design effective methods to learn similar representations of speech and text, and establish connections between different perceptual modalities, so that the ST model can better use information from different modalities, and ultimately improve the performance.

Methods
In this section, we will first describe the method of reducing the modality gap between speech and text through optimal transport (OTST).And then adaptive multi-task learning for OTST is introduced in detail.Based on E2E Transformer, we leverage the Wasserstein distance between the speech feature sequence and the text feature sequence using the optimal transport before the Transformer encoder and add the OT loss to the model training loss to make the encoded speech and its corresponding text close to each other in the Wasserstein space.In model training, three weights assigned to ST, MT, and ASR loss are automatically tuned according the proportion.Figure 1 provides a schematic depiction delineating the conceptual framework of our proposed methodology.

Problem formulation
Speech translation aims to translate source language speech into target language text.The corpus of ST is usually composed of triplet data D = (s, x, y) , where s = (s 1 , . . ., s |s| ) represents the source language speech sequence, x = (x 1 , . . ., x |x| ) is the transcript from the source language, and y = (y 1 , . . ., y |y| ) is the correspond- ing translation in the target language, |s| , |x| and |y| respec- tively represent their lengths.

Model architecture
We use the same multi-task network model architecture as XSTNet [26], which combines multiple training tasks of ST, ASR and MT, aiming to achieve E2E ST.The model consists four modules: a speech encoder, a text embedding layer, a Transformer [27] encoder, and a Transformer decoder.It supports audio and text inputs, and these two inputs share the Transformer module in the model.
Speech encoder extracts contextualized acoustic embeddings from the raw waveform.It consists of Wav-2vec 2.0 [28] and subsampler.The input is raw waveform signal sampled at 16 kHz.Wav2vec 2.0 first extracts a speech representation from the original waveform signal, but the output sequence of Wav2vec 2.0 is usually much longer than the corresponding text sequence.To further match the length of the audio representation and text sequence, we further add 2 convolutional layers with stride of 2 after Wav2vec 2.0 to reduce the time dimension of the speech representation by a factor of 4.
Text embedding is set in parallel with the speech encoder to capture semantic information in the text and map the text token into embeddings.We calculate the Wasserstein distance from the parallel speech and text sequences obtained from the speech encoder and text embedding layer through optimal transport.
Moreover, both the speech encoder and the text embedding layer are connected to the Transformer encoder.The encoder receives the output of the speech encoder or text embedding layer and further learns semantic information, which is then processed by the Transformer decoder to obtain the final output of the model.
We first undertake pre-training of the model using external MT data and then optimize the entire model by minimizing cross entropy loss.

Cross-modal optimal transport
Optimal transport (OT) is a classical mathematical problem that provides powerful tools for comparing different probability distributions [29].It is usually the solution to the problem of minimizing the cost of transferring one distribution to another, so that the distance between two discrete probability distributions is minimized after transmission.If we regard speech and text sequences as two independent distributions, OT can be used to measure the distance between them.

Optimal transport
For two discrete probability distributions α and β, where, α is represented by the mass Given the transportation cost function c(U i , V j ) , let Z ij ≥ 0 represent the mass transferred from U i to V j , then the total transportation cost can be expressed as n i=1 m j=1 Z ij C U i , V j .Let Z * be the transportation plan with the lowest transportation cost, which is calculated as follows: where Z and C denote the n × m matrices whose ele- ments are Z ij and C ij = c u i , v j . (1)

Wasserstein distance
The Wasserstein distance between α and β is defined as W (α, β) =< C, Z * > but evaluating it is expensive in practice.Usually, the upper-bound approximation function W (α, β) for the Wasserstein distance is solved, defined as where H (Z) = − n i=1 m j=1 p(Z ij ) log(p(Z ij )) is the entropy function, which is used as a regularization to improve the optimization result.> 0 is a regularization weight.p(Z ij ) denotes the probability of passing Z ij units of mass from position u i to position v j .W is evaluated using the Sinkhorn algorithm [30].

Wasserstein distance between speech and text
For the two independent distributions of speech and text sequences, we can use OT to measure the distance between them.Set the speech sequence as H s = h s 1 , . . ., h s i , . . ., h s n and the text sequence as H x = h x 1 , . . ., h x j , . . ., h x m .Define two distributions α and β, whose mass is uniformly distributed at positions h s 1 , . . ., h s i , . . ., h s n ∈ R d and h x 1 , . . ., h x j , . . ., h x m ∈ R d , that is, the mass at all positions of distribution α is 1 n , and the mass at all positions of distribution β is 1 m .Let the transportation cost of a unit mass from h s i to h x j be C h s i , h x j = �h s i − h x j � p , with p ≥ 1 (typi- cally p=2).L ot = W (α, β) can be seen as the difference between speech and text sequences, and we call this value the Wasserstein distance, which is added as a loss to the model training loss function.

Adaptive cross-entropy loss
Multi-task cross-entropy loss L multi−ce = ω 1 L ST + ω 2 L ASR + ω 3 L MT , where L ST , L ASR , and L MT are cross-entropy losses on < s, y > , < s, x > , and < x, y > pairs.The weight ω 1 , ω 2 , and ω 3 correspond to the extent to which the model updates each task during the training process.The cross-entropy loss functions for ST task, ASR task, and MT task are as follows: (5) To allocate task weights more effectively and optimize the model's performance on the target task, the weight ω at training step t is determined by the proportion of the corresponding loss value at training step t − 1 to the total loss value.We express the weight as: Therefore, the model can dynamically adapt its learning strategy according to the learning level of each task, thereby find the optimal combination of task weight to balance multi-task learning.Finally, the weight update scheme based on the loss proportion is referenced into the multi-task cross-entropy loss function, and the loss for training steps t is obtained as:

ST datasets
We evaluate our methods presented in this paper on Tibetan-Chinese (Ti-Zh), English-German (En-De), and English-French (En-Fr) directions.The Ti-Zh dataset was constructed from the TIBMD@MUC [31] dataset.For the En-De and En-Fr directions, we used the MuST-C [32] dataset from TED Talks.The detailed statistics of the dataset are shown in Table 1.

External MT datasets
We also introduce external MT datasets to pre-train our translation model.For En-De and En-Fr directions, we randomly selected 2 million and 250,000 bilingual (8) parallel sentences from the WMT [33] dataset, respectively.For Ti-Zh directions, We collated 270,000 bilingual parallel sentences based on the TIBMD@MUC dataset.

Experimental setups 4.2.1 Model configuration
Our implementation is based on the FAIRSEQ toolkit [34].Following the standard practices in ST, we employ the Wav2vec 2.0 model with overlaid subsamplers as the speech encoder.The subsampler consists of two convolutional layers with a stride of 2, kernel size of 5, and an output channel size of 512, aimed at reducing the length of the speech sequence and alleviating the length discrepancy between speech and text embeddings.The dimensionality of the text embedding layer is set to 512.For the Transformer, we adopt a basic configuration, including 6 layers for both the encoder and decoder, each layer comprising 512 hidden units, 8 attention heads, and 2048 feed-forward network (FFN) hidden states.

Data preprocessing
For the speech input, we use the 16-bit 16 kHz monochannel raw audio.To ensure training efficiency, we filter out samples with frames greater than 480k or less than 1k.As for the text input, we tokenize transcripts and translations using the SentencePiece [35] model.The vocabulary size of 10k is shared between source and target languages.For external MT datasets, parallel sentence pairs with length ratios exceeding 1.5 are filtered out.

Experimental details
During the training phase, we employ the Adam optimizer [36] to update parameters, with an initial learning rate set to 2 × 10 −4 and warm-up steps to 15k, dropout of 0.1.In the inference phase, we use beam search with a beam size of 10.We evaluate the BLEU on the test set using sacreBLEU [37] as the evaluation metric for the translation task.All models are trained on Nvidia V100 GPUs.

Results
Table 2 shows the BLEU values of each model.Compared to the base model, adaptive-OTST has an improvement of 1.16, 0.41, and 1.19 BLEU in the Ti-Zh, En-De, and En-Fr directions, respectively.We also compare our approach with other baseline models, including XSTNet [26] using progressive training procedure, ConST [9] using contrastive learning strategy, and STEMM [24] using mixed speech representation sequences and word embedding sequences.As most existing performance improvements rely on the utilization of large-scale external MT data, for fair comparison, we study two settings: (1) without external MT data and (2) with external MT data.For settings without external MT data, our method improved 2.01 BLEU on average in three directions compared to base model.For settings with external MT data adaptive-OTST's performance also surpasses that of other strong baselines.
To validate our method under extremely low-resource settings, we constructed 10 hours ST subsets using random sampling from the TIBMD@MUC Tibetan-Chinese dataset and the MuST-C dataset, respectively.In the extremely low-resource ST setting, we compared our method with other models, with results shown in Table 3. Adaptive-OTST consistently outperforms baseline methods in all three language directions.

Ablation study
As a multi-task learning framework, the performance of our ST system is influenced by the training objectives.Through ablation study, this paper evaluates the impact of multi-task learning modules and OT methods on model performance.Table 4 shows the performance of the model under different training objectives.The experimental results indicate that for the Ti-Zh translation direction, both the multi-task learning module and the OT method contribute to the improvement of model performance.Based on the results of Exp I, Exp II, and Exp IV, it can be concluded that using multi-task learning methods can bring an improvement in 2.37 BLEU, removing adaptive cross-entropy loss in multitask dynamic weight adjustment methods results in a decrease in translation performance.The results of Exp I and Exp III indicate that the introduction of OT loss methods can achieve significant performance advantages in ST systems.

Comparison between OT and other losses
In this paper, we reduce the distance between speech and text representations by introducing the OT method.In order to prove the effectiveness of the OT method, we introduce cross-attentive regularization (CAR) [38] at the input layer of the Transformer encoder.
Due to the distinct input modalities of speech and text, their representations may have different lengths and cannot be directly compared.Hence, we first reconstruct the speech feature sequence from the output of the speech   encoder and the text feature sequence from the output of the text embedding layer.The two reconstructed sequences are calculated from the text output sequence via self-attention or the speech output sequence via cross attention over the text output sequence.Both reconstructed sequences have the same length and the similarity between the speech and text feature sequences can be measured by the L2 distance between these two reconstructed sequences, where a smaller distance indicates higher similarity between speech and text.Table 5 shows the BLEU scores of the models under different methods, with the OT loss resulting in a 0.88 BLEU higher score than the CAR loss.

Positions of OT
In speech translation, speech features can be simply divided into acoustic features and semantic features.After the speech signal is processed by a speech encoder, the number of acoustic and semantic features in the low-level speech representation is roughly equivalent.However, in the high-level speech representation output by the Transformer encoder, semantic features usually dominate.The modal differences between these two layers of speech representation and their corresponding text representation are apparent.However, there is currently no consensus on which layer's modal difference reduction would yield a more significant impact on enhancing the performance of ST models.To this end, we introduced OT techniques in the input and output layers of the Transformer encoder to reduce modal differences.As shown in Table 6, introducing OT in the input layer of the Transformer encoder to reduce modal differences can result in better performance of the model compared to the output layer.We believe that there is more original alignment information in the lower layers of the model, which is more suitable for OT calculation.

Weight setting for OT loss
In this section, we discuss the impact of the OT loss weight .We experimented with several values ranging from 0.1 to 1.0.Figure 2 visually demonstrates the variation in model BLEU scores with different values, with the highest BLEU achieved when = 0.25.When the OT loss weight is too small, its effectiveness in reducing modality gap is minimal, resulting in the model's inability to effectively leverage MT for improving ST performance.Conversely, when is too large, the model excessively focuses on narrowing the modality gap between speech and text, leading to a decline in the performance of the primary task ST.Therefore, we opted for a moderate weight setting, selecting the hyperparameter = 0.25 to achieve optimal model performance.

Conclusion
In this paper, we propose adaptive-OTST, which uses an adaptive cross-entropy loss function based on task loss proportion as the multi-task objective function to improve the adaptive ability of the multi-task ST model.In addition, it reduces the modality gap by bringing closer the distance between speech and text representations in the Wasserstein space, leading to better performance.The experiment demonstrates the efficacy of our approach in low-resource ST.In the future, we hope to integrate the optimal transport with other methods to bridge the modality gap and further improve the performance of ST.

Table 1
Statistics of the dataset

Table 2
BLEU scores of different models

Table 3
Results of each model under extremely low-resource settings

Table 4
Ablation study in Ti-Zh direction