Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition

Due to the dynamic nature of human language, automatic speech recognition (ASR) systems need to continuously acquire new vocabulary. Out-Of-Vocabulary (OOV) words, such as trending words and new named entities, pose problems to modern ASR systems that require long training times to adapt their large numbers of parameters. Different from most previous research focusing on language model post-processing, we tackle this problem on an earlier processing level and eliminate the bias in acoustic modeling to recognize OOV words acoustically. We propose to generate OOV words using text-to-speech systems and to rescale losses to encourage neural networks to pay more attention to OOV words. Specifically, we enlarge the classification loss used for training neural networks' parameters of utterances containing OOV words (sentence-level), or rescale the gradient used for back-propagation for OOV words (word-level), when fine-tuning a previously trained model on synthetic audio. To overcome catastrophic forgetting, we also explore the combination of loss rescaling and model regularization, i.e. L2 regularization and elastic weight consolidation (EWC). Compared with previous methods that just fine-tune synthetic audio with EWC, the experimental results on the LibriSpeech benchmark reveal that our proposed loss rescaling approach can achieve significant improvement on the recall rate with only a slight decrease on word error rate. Moreover, word-level rescaling is more stable than utterance-level rescaling and leads to higher recall rates and precision on OOV word recognition. Furthermore, our proposed combined loss rescaling and weight consolidation methods can support continual learning of an ASR system.


Introduction
Recently, end-to-end ASR models have been receiving a lot of attention and achieving impressive performance [1,2,3]. These models significantly simplify the training process to directly map acoustic inputs to characters or words.
Additionally, limited domain-specific knowledge is required, which dramatically boosts model development and deployment. However, end-to-end models need a lot of training data and perform poorly on words out-of-vocabulary (OOV) or rarely existing in the training data, for example, trending words and new named entities.
Since it takes substantial efforts to collect labeled OOV speech data for ASR model training, current approaches to tackle the OOV problem mainly involve a language model (LM) or post-processing, for instance, user-dependent language models [4,5], LM rescoring [6] and finite-state transducer lattice extension [7].
However, the post-processing techniques only obtain limited improvement as they do not tackle the root causes at the acoustic level.
Alternatively, fine-tuning end-to-end ASR models with synthetic audio containing OOV words can efficiently improve the recall rate of unseen vocabulary, which usually leverages advanced text-to-speech (TTS) systems to generate audio-text pairs required for ASR model training. However, the catastrophic forgetting problem substantially degrades the overall performance of ASR systems, especially on non-OOV words. Elastic weight consolidation (EWC) [8] is adapted to tackle this problem but leads to a limited recall rate improvement for OOV word recognition. In this paper, we take this method a step further and propose loss rescaling to encourage models to pay more attention to unknown words. Instead of just fine-tuning ASR models where all words are treated equally, enlarging the loss of utterances containing OOV words (sentence-level) or increasing the gradient of unseen words (word-level) can efficiently incline the model to update the weights related to OOV words. We choose 100 OOV words appearing in LRS3-TED dataset but not existing in LibriSpeech dataset. Then, we crawl texts including the new words from the Internet and synthesize audio with TTS systems. The experimental results of fine-tuning audio-text pairs on a hybrid CTC 1 /attention ASR model show a significant improvement on recall.
When combining EWC with the word-level loss rescaling, we achieve 45.81% of recall on the ROOV test with only 7.8% and 4.6% of relative WER increase on the LibriSpeech test-clean and test-other data sets respectively. As a result, we have improved the recognition of OOV words while maintaining the accuracy for non-OOV words.

The Recognition of OOV Words in End-to-End ASR Models
Since the OOV problem has occurred, a number of approaches have been proposed for conventional GMM-HMM 2 models [9,10] and hybrid DNN-HMM 3 models [11,12]. In this section, we only review methods towards end-to-end ASR architectures which have been the most promising methods in speech recognition in recent years.
Aleksic et al. [13] extend class-based LMs [4,5] by creating a user-dependent small LM for contact name recognition on voice commands, which is compiled 1 CTC is the abbreviation for Connectionist Temporal Classification. 2 GMM and HMM are short for Gaussian Mixture Model and Hidden Markov Model respectively. 3 DNN is short for Deep Neural Network.
dynamically based on the contact names on users' devices. Moreover, contact insertion reward is proposed to avoid excessive bias and to balance the information between user-dependent and user-independent cases. Hori et al. [14] combine word-level with character-level language modeling in end-to-end archi- Different from most of the previous work focusing on LM post-processing which requires candidate units existing in n-best lists or decoding lattices, in this paper, we tackle the OOV problem on an earlier processing level by eliminating the bias in acoustic modeling to recognize OOV words acoustically.

Data Augmentation with Synthetic Audio for ASR
Proper speech data augmentation does not only boost the model performance, but can also significantly improve system robustness and generalization [22]. There are many strategies used in ASR training, for example, noise addition, pitch shifting, speed perturbation, back-translation [23] and room impulse response injection with real or simulated data [24]. More recently, a simple yet effective approach, SpecAugment [25], has been proposed and achieves state-of-the-art results on the LibriSpeech benchmark corpus. The basic idea for SpecAugment is randomly masking or cropping a fixed area on spectrograms in the time-or frequency-domain, which effectively prevents model overfitting, especially for noisy conditions. Another well-established method is mixing synthetic audio with real data by leveraging advanced TTS models, like Tacotron2 [26], DeepVoice3 [27] and FlowTron [28].

CTC-based End-to-End Models
The basic idea of CTC-based approaches is to bring in a special token, 'blank', which is dynamically filled in the places between modeling units. Consequently, the exact boundary information required by conventional methods is not needed anymore. Then, a carefully designed dynamic programming algorithm is used to search optimal paths and convert the frame-level token se-quences to meaningful utterances by removing blank tokens and merging repeated labels.
Graves et al. [1] firstly adopt the ground-breaking CTC approach to over-

Attention-based Encoder-Decoder Models
Another branch of end-to-end systems is the attention-based encoder-decoder architecture. Different from CTC or RNN-T architectures, the attention mechanism dynamically aligns the encoder and decoder time steps to temporally align the input and output sequences. Chorowski et al. [44] transfer the attentionbased recurrent networks to speech recognition and achieve comparable results on the TIMIT benchmark that compare well with conventional methods. Chan et al. [45] propose LAS (listen, attend and spell) to directly map acoustic inputs to transcription. The results show that the attention mechanism prevents model overfitting on the training set. Meanwhile, Bahdanau et al. [46] show that attention-based models can implicitly learn better context information than CTC and conventional models. In addition, local monotonic attention [47], full-sequence attention [48], time-restricted self-attention [49], multi-channel attention [50] and online attention [51] are proposed to reduce the complexity of attention computation and learn more robust alignments.

Hybrid CTC/Attention Architectures
To fully incorporate the merits of CTC and attention models, Kim et al. [52] propose to jointly train CTC and attention-based approaches in a multi-task learning fashion by sharing one encoder. The evaluation on WSJ and CHiME-4 noisy speech shows the hybrid architecture can efficiently speed up convergence and learn more robust alignment between input frames and output sequences.
Hori et al. [53] extend the hybrid CTC/attention method with a joint decoding algorithm by rescoring or combining the probabilities from both objective functions. Then, a monotonic chunk-wise attention [54] and transformer-based encoder [55] are utilized to enable the hybrid CTC/attention model to work in online streaming tasks. Zhang et al. [56] propose a new two-pass approach (U2) which unifies the streaming and non-streaming ASR models into one architecture. The hybrid CTC/attention architecture is becoming more and more popular. There has been a trend to unify the streaming and non-streaming architecture into one model with this architecture. However, limited research has focused on the recognition of OOV words in end-to-end architectures. In this paper, we tackle the problem of OOV words with the U2 ASR model 4 which will be described in more detail in 4.1.

Methodology
In this section, we demonstrate the proposed loss rescaling approaches at sentence level and word level. Furthermore, we introduce the techniques of L2 regularization and EWC used to overcome catastrophic forgetting problems.

Loss Rescaling at Sentence Level
During training, the CTC function returns one loss per utterance, and the mean of all utterance losses in the same mini-batch would be used for backpropagation. As shown in Figure 1 (a), each bar in the figure means one utterance loss in a randomly selected mini-batch. We observe that the utterance losses in one mini-batch are evenly distributed. Sometimes, the loss of utterances containing OOV words can be slightly higher or even lower than other utterances without OOV words. Consequently, the model pays equal attention to each utterance or word, which leads to the final model performance heavily relying on the frequency of words in training sets. Sometimes, the model attention is even biased by non-OOV words.
To emphasize an utterance containing OOV words, we rescale the utterance loss, as indicated in Figure 1

Loss Rescaling at Word Level
Given an input acoustic vector x = (x 0 , · · ·, x T ), and a target label sequence y = (y 0 , · · ·, y U ), where T >> U , and T and U are the length of the acoustic vector and target label sequence respectively, the CTC loss aims to maximize the log probability in Eq.
After processing the last input vector x T , we can get a (2U + 1) × T lattice matrix, for instance the one in Figure 2 for the utterance "News about Brexit" modeling on subword units is shown. We denote y(t, u) and b(t, u) as the probability of a label and a blank token at node (t, u) respectively. According to the definition of CTC [57], as shown in   According to the three cases shown in Figure 4, the forward variable α(t, u) can be calculated recursively as follows: It is worth noting that when the two adjacent tokens are same, i.e. (t, u) = (t − 1, u − 2), there is no direct transition between the two repeated tokens and the second one can only be reached through the blank token or the node of (t − 1, u).
Similarly, blue arrows reaching the node (t, u) and the backward variable β(t, u) can be represented by Eq. (4) in three cases.
Thereby, the gradient of the CTC loss function L CT C w.r.t y(t, u) and b(t, u) can be estimated by Eq. (6) and Eq. (7) respectively.
+β(t + 1, u + 2)) and (t, u) = (t + 1, u + 2) The CTC function treats all nodes equally and aims to minimize the global loss, which makes models hardly focus on local connections in the decoding lattice. To guide models to pay more attention to the OOV words, we emphasize the OOV words (the nodes in the dotted box in Figure 2) by rescaling the probabilities of OOV nodes in candidate alignments. Thus, the rescaled probability of all alignments passing through OOV nodes is as follows: The regularized loss function at word level is: We implement our approach by multiplying the gradients of OOV nodes on the candidate path with µ, as shown in the following equations: where O is the set of OOV words tokenized into subwords.

Overcoming Catastrophic Forgetting
Directly fine-tuning models on a dataset obeying a different distribution from the original training set may lead to catastrophic forgetting. The updated model may overfit the new dataset but forget the knowledge learned on the original one. To overcome models suffering from catastrophic forgetting, we adopt two approaches during fine-tuning. The first one is mixing partial original audio from LibriSpeech used for baseline model training with synthetic speech, since adding data that obeys the same distribution as the training set can efficiently mitigate the forgetting problem. We explore the effect of different mixing ratios and present the results in Section 5. The other approach is constraining model parameters from updating during fine-tuning with L2 regularization or EWC [8], and we will introduce the details in the following sections.

L2 Regularization
The L2 regularization loss L L2 (θ) is shown in Eq. (12), where L(θ) is the original CTC loss or rescaled loss in Eq. (1) and Eq. (9). θ i is the ith parameter of the ASR model to be updated during fine-tuning, and θ i is the ith parameter in the baseline model which is invariable and saved locally. λ is the coefficient to balance the scale of two parts.
L2 loss takes the difference between the fine-tuned model and the old model into account to ensure the updated model will not stray away too much from the baseline.

Elastic Weight Consolidation
Different from L2 loss that always refers to a fixed standard and treats all parameters equally, the EWC loss as shown in Eq. (13) uses the diagonal of the Fisher information matrix F to dynamically weigh the importance of each model parameter for the source task.
The Fisher information matrix F can be estimated by the following equation with the gradients of the convergent source model.
where θ and D are the parameters and the dataset used in the source ASR task

ASR Model Architecture
The end-to-end ASR model used in our experiments is the two-pass hybrid CTC/attention architecture, U2 [56], as shown in Figure 5. The shared encoder converts acoustic features x into a latent vector h enc , then the CTC decoder transforms the latent vector into character/word probability P (y t |x t ) with the same length as the input frames. Meanwhile, the attention decoder generates one character/word probability P (y u |y u−1 , · · ·, y 0 , x) per time step by conditioning on the attention content vector c u and the decoder output from the last step y u−1 . During training, the sum of CTC loss and attention loss is used to do back-propagation, while during inference, the n-best hypotheses produced by the CTC decoder are rescored by the attention decoder to obtain better performance. The candidate with the highest score will be the final output.

OOV Set with Real Audio
We build a 100-OOV-word dataset from LRS3-TED [59] corpus since there are no standard OOV corpora published by the community. LRS3-TED is an audio-visual dataset collected from TED and TEDx talks with spontaneous speech and various speaking styles. It is comprised of over 400 hours of video by more than 5000 speakers and contains an extensive vocabulary. We filter the vocabulary existing in LRS3-TED but not present in LibriSpeech and select 100 OOV words from more than 100 speakers used for test, where each OOV word contains 50 utterances. We random split these utterances into training, validation and test sets with a ratio of 2:1:2. The duration ratio of training, validation and test sets is 3h : 1.6h : 2.8h. In the rest of this paper, the three sets will be referred to as real OOV (ROOV) training, ROOV val and ROOV test set respectively. We report all experimental results on the ROOV test set.
More details about the 100 OOV words can be found in the Appendix.

OOV Set with Synthetic Audio
Our goal in this paper is to improve OOV word recognition with synthetic audio and loss rescaling methods. In this subsection, we introduce the synthetic dataset used for model training.

Evaluation Metrics
We use 3 metrics to evaluate the experimental results of our proposed method: • WER: word error rate is the ratio of error terms, i.e., substitutions, deletions, and insertions, to the total number of words in the reference. P recision = T P T P + F P (16)

Training Settings
The baseline model is trained on the 960h LibriSpeech dataset with a batch size of 12, an initial learning rate of 4e-3 and 25000 warm-up steps. When doing fine-tuning for OOV experiments, we use a bigger batch size of 20 to enable the model to see more utterances not containing OOV words and avoid loss explosion. A tiny initial learning rate of 4e-6 is utilized for fine-tuning, which is annealed with a value of 1.1 after every 3000 steps, since a tiny learning rate can efficiently ensure stable model learning and retain the previously learned knowledge. In addition, to avoid gradient explosion, we clip all gradients greater than 2, while the threshold used in baseline training is 5.
The validation set during model training is the mixture of the LibriSpeech dev and OOV TTS dev set with the ratio of 1:1. The model checkpoint performing best on the mixture validation set is used for evaluation on test sets with early stopping. It is noteworthy that the attention mechanism and attention decoder are always frozen since we found that fine-tuning the entire network leads to gradient explosion. In addition, it is hard to balance the CTC loss and attention loss since the CTC loss is rescaled in our methods. We only use the the attention and the decoder for rescoring.

Experimental Results
In this section, we report the experimental results from the following perspectives.

Results of Speech Mixture from Source and Target Domain
To mitigate catastrophic forgetting, we mix original real speech in the Lib-riSpeech training set (source domain) with the ROOV or the SOOV training set (target domain). It is still an open question what the best mixing ratio is.
We fine-tune the baseline model with different ratios and report results on the standard LibriSpeech test sets (test-clean and test-other) and the ROOV test set.
As shown in Table 1, when fine-tuning only with ROOV training set (real speech from LRS3-TED), the model shows the inability to retain old knowledge and performs badly on previous LibriSpeech tasks, which leads to a tremendous rise in WER on the test-other set from 8.72% to 41.52%. Context information is crucial for ASR models. In the 0:1 setting, the pre-trained ASR model is destroyed, especially the learned context knowledge. Consequently, the overfitted model is hard to infer a correct context and recognize OOV words. The forgetting tendency slows down as original data is incorporated into training. When the ratio of audio from LibriSpeech and LRS3-TED is 2:1, the model achieves the highest recall of 32.05%. The more data from the source task (LibriSpeech) is used for training, the more previous knowledge is retained and the better performance is obtained on the LibriSpeech evaluation sets. However, the model tends to focus on the previous LibriSpeech tasks as the ratio increases, which leads to the decrease of recall on the ROOV test.
We can draw the same conclusion when fine-tuning ASR models with synthetic data as shown in Table 2. We prioritize the model performance regarding recall since the goal of this paper is to enable the ASR model to learn new vocabulary, and the catastrophic forgetting issue will be tackled in the next section. Therefore, the 2:1 mixture ratio is used in the following experiments.

Results of Loss Rescaling at Sentence Level
In this section, we explore the effect of loss rescaling at sentence level. We compare the model performance using real speech data with the results using synthetic audio. As shown in Table 3, using L2 and EWC regularization efficiently reduces catastrophic forgetting and improves the recall rate on OOV words while the WER increases only in few cases on the test-clean and testother test sets. We find λ = 5e7 is the best weight to balance the L2/EWC loss and the ASR losses.
We reproduce the method proposed by Zheng et al. [61], in which a RNN-T ASR model is fine-tuned with EWC on mixed real and synthetic audio. However, the dataset is not published. It is noteworthy that the experimental results reported in Table 3 and Table 4 are based on our generated data with the method proposed by Zheng et al. [61]. In addition, as shown in Table 3, row "Isolated Words", we de-emphasize the non-OOV words by fine-tuning ASR models with utterances containing only isolated OOV words which are segmented from real or synthetic continuous speech according to the time alignment information obtained from the Montreal Forced Aligner 13 . Compared with the method proposed by Zheng et al. [61], fine-tuning ASR models with only isolated OOV words can effectively improve the recall rate but it leads to much more serious forgetting on non-OOV recognition. When just fine-tuning the base model with real or synthetic OOV audio, all words are treated equally, which leads the model to hardly focus on the OOV words we concern. Therefore, we propose loss rescaling and encourage the model to pay more attention to OOV words by enlarging the loss of sentences containing unknown words. For the loss rescaling weight µ, we examine the values 1, 10, 100, 1000, and 10,000. As we can see in Table 3, the OOV recall rapidly increases when rescaling the target sentences by 100 times compared to only fine-tuning using L2 (50.02% VS 24.13% for real speech and 31.26% VS 24.13% for synthetic audio).
As a bigger µ is used, the recall further rises, but the WER on the testclean and the test-other test sets is getting worse. We hypothesize that directly rescaling the entire sentence loss may also enhance irrelevant words or noises, which leads to gradient explosion during training and accelerates forgetting previous knowledge. Hence, we have to use a very small learning rate and clip the gradients over 2.0 to ensure the progress of fine-tuning. In contrast to L2 regularization, EWC can provide more stable and resilient protection of the weights important for the previous LibriSpeech tasks but still with a relatively high loss in the ASR performance, as shown in Table 3.
In addition, when fine-tuning only with the synthetic audio, we obtain competitive recall rate compared to utilizing real speech from the LRS3-TED dataset, for example, when using EWC and rescaling the loss by 1000 times larger, we achieve 54.06% VS 42.38% recall rate for real and synthetic speech.
Furthermore, compared with the method proposed by Zheng et al. [61], rescaling the OOV utterance loss can achieve significant improvement on recall with only slight decrease on WER and precision.

Results of Loss Rescaling at Word Level
Instead of enhancing the entire sentence loss, in this section, we report the results of only rescaling unknown words. As shown in Table 4, the λ weight, which is needed to balance the ASR and L2/EWC loss, is smaller (1e7) than the one used at sentence level (5e7) in Table 3, and we do not observe gradient explosion during training unless µ is very large, e.g. 1e4. The results without loss rescaling (µ = 1) is slightly different from the one shown in Table 3 In addition, we obtain competitive performance by only using synthetic audio compared with using real speech data for fine-tuning.

Discussion
New vocabulary emerges all the time due to the evolution of human language.
Therefore, it is important to enable a trained ASR system to dynamically acquire unseen vocabulary. The combined loss rescaling and weight consolidation methods proposed in this paper can support continual learning [62] of an ASR system. The methods neither require any labeled data nor do they require retraining a new ASR model from scratch.
An interesting finding is that enhancing the gradient of blank tokens within and after OOV words is important as well, which encourages the decoding pro-cedure moving forward, for instance, the rows of u 6 , u 8 and u 10 in Figure 2.
Otherwise, the decoding progress is cut off and models repeatedly produce one token, such as "news about bre bre bre " when only enlarging the gradient of "bre " in the utterance of "new about bre xi t". Sometimes, the fine-tuned ASR system even repeats one token, for example "bre bre bre bre bre ", when µ is very large.
Additionally, we find that the performance and the speed of convergence are affected by the batch size, especially for loss rescaling at sentence level. When the batch size is small, e.g. 5, all utterances in one batch may contain OOV words, which leads to a bigger rescaled loss. Consequently, the model suffers from gradient explosion, and L2 or EWC regularization can hardly constrain the model weights diverging.

Conclusion
In this paper, we present the use of synthetic speech to boost an ASR model on the recognition of OOV words. In addition to fine-tuning with audio containing OOV words, we propose to rescale loss at sentence level or word level, which encourages models to pay more attention to unknown words. Experimental results reveal that fine-tuning the baseline ASR model combined with loss rescaling and L2/EWC regularization can significantly improve the OOV word recall rate and efficiently overcome models suffering from catastrophic forgetting. Furthermore, loss rescaling at word level is more stable than at sentence level and results in less ASR performance loss on general non-OOV words and previous LibriSpeech tasks. The combination of proposed loss rescaling, which updates the new task-related parameters (OOV word recognition), and EWC, which retains the old task-learned weights (speech recognition on the LibriSpeech dataset), can enable continual learning of an ASR system.
The proposed target word loss rescaling method is simple and effective, but there are still some issues left to be improved. Currently, results are evaluated on synthetic audio data which is different from the spontaneous speech recorded in the real world. Future work could focus on real-scenario speech collecting and model evaluation, which enables us to well understand and compare the contribution of using synthetic and real speech containing OOV words. Additionally, the current OOV word set needs to be known, how to automatically detect and optimize OOV words is a potential direction. The trade-off between WER on universal test sets (e.g. LibriSpeech test-clean and test-other sets) and recall rate on the OOV set is another issue. A dynamic L2/EWC weight [63] can be adopted to replace the fixed λ weight. Later in the fine-tuning, a fixed regularization weight could influence model updating. Moreover, we are interested in investigating the effectiveness of our proposed method on RNN-T and attention-based encoder-decoder ASR systems. It is also worthwhile to explore our loss rescaling method on some general unbalanced label problems, for example, speaker diarization and voice verification. Since continual learning in sequence processing is a young research field [64,65,66], our loss rescaling method may have wider implication for data where novel elements are learnt in temporal or spatial context with known elements.