Recognition of target domain Japanese speech using language model replacement

End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.


Introduction
Traditional automatic speech recognition (ASR) systems, such as those based on Gaussian mixture model HMM (GMM-HMM) or deep neural network HMM (DNN-HMM), are very complex, consisting of various modules such as acoustic models, dictionaries, and language models [1,2].On the other hand, end-to-end (E2E) ASR models, which use deep learning, can represent these complex speech and language processes using a single neural network.A wide variety of E2E ASR models have been proposed over the past few years, such as those based on long short-term memory (LSTM) [3,4] and on Transformer [5][6][7][8] with an attention mechanism [9], which have been used with great success in the field of natural language processing (NLP) for tasks such as machine translation [10,11].E2E ASR models based on connectionist temporal classification (CTC) [12] and Transducer [13][14][15][16] have also been proposed.
As a result of recent advances in ASR technology and increasing ease of use, we have seen greater use of ASR models in various commercial applications.For example, ASR models are now used in AI speakers and in speech assistants such as Alexa [17] and Siri [18], which has made ASR technology more and more familiar to the public.However, training such ASR models requires a large amount of speech and transcription data.We have found, in our own previous research, that creating a dataset for a target domain (e.g., the medical domain) for an ASR model is expensive in terms of both time and money.Therefore, in Japan, we often train ASR models for commercial use with publicly available datasets such as the Corpus of Spontaneous Japanese (CSJ) [19] or the LaboroTVSpeech corpus [20] or use publicly available, large-scale ASR models pre-trained in generic domains.However, general domain ASR models and ASR models trained with publicly available datasets may not perform as required in a target domain environment.In fact, we have had great difficulty creating an ASR model for a specialized domain.
Against this background, a method to adapt existing large-scale ASR models to a target domain would be very useful.Currently, fine-tuning is the most popular and effective domain adaptation method for ASR tasks [21].This involves re-training a large-scale, outof-domain ASR model with a small amount of target domain speech and transcription data, in order to create a target domain-adapted ASR model.Many studies have proposed efficient fine-tuning methods which use limited computing resources, such as Adapters [22].To fine-tune an ASR model, it is generally necessary to prepare several hours of target domain training data, which includes speech and its transcription; thus, there is still the problem of the cost of preparing target domain training data.
Another effective method for domain adaptation of ASR models is to use ASR models in combination with external language models [23][24][25][26], the most common method of which is Shallow Fusion [27,28].Other methods for combining ASR models with external language models have also been proposed, including Cold Fusion [29] using gate mechanisms [30], Component Fusion [31], and Deep Fusion [11].All of these language model integration methods improve ASR performance; however, there are some drawbacks associated with each method.The Shallow Fusion method adds the output probability of the language model for the target domain to that of the existing ASR model, which is dependent on the language information contained in the training data used.This means that Shallow Fusion adds the output probabilities of two models trained with different language information.The Deep, Component, and Cold Fusion methods require retraining of the ASR each time a new language model is integrated, so these methods have not replaced the simple Shallow Fusion method as the go-to method among most of the ASR community, since Shallow Fusion does not require model retraining, as the language model is only applied during decoding.
In the days when Gaussian mixture model-hidden Markov model (GMM-HMM) and deep neural networkhidden Markov model (DNN-HMM) ASR models were primarily being used, it was less difficult to change the domain of an ASR model because the acoustic model, dictionary, and language model could each be easily replaced with target domain versions.However, since E2E ASR models are simultaneously trained with acoustic and language information, it is very difficult to completely separate the acoustic and language information contained within an E2E ASR model.
Given these backgrounds, the goal of this study is domain adaptation by separating acoustic and language information inside the E2E ASR model.Since these two pieces of information are trained at once in a single neural network model, it is impossible to strictly separate them.In this paper, we propose to approximate the separation of these two pieces of information by subtracting the "implicit" language model probability in the log-likelihood domain.We also conducted validation on a Japanese ASR task.Some studies have parallely been conducted using very similar formulation, which is called density ratio approach (DRA) [32] and reported its effectiveness mainly on recurrent neural network (RNN)-Transducer in English, Spanish, and Italian ASR tasks [15,32].In contrast to these studies, we will construct Japanese encoder-decoder ASR models and apply the language model replacement to the models [33,34], and make an analysis of the behavior, for the first time.We believe that integrating external language models with Japanese ASR models may be more difficult than the task of dealing with these alphabetic language ASR models.The reason for this is the size of the Japanese vocabulary.The Japanese vocabulary is huge, with thousands of unique tokens (characters) used in common speech, whereas the alphabetic languages such as English, which has only 26 letters.Even using subwords, the number of tokens are at most one or two thousands.The grammer is much more strict than Japanese, so the prediction by the language model tends to be easier than Japanese.Japanese is also known to have more homonyms and pronunciation variations than alphabetic languages.There are three types of characters in Japanese: kanji, hiragana, and katakana.In addition, Japanese has other complex and difficult prefixes and suffixes.Furthermore, Japanese has a grammar with very loose grammatical constraints, especially in spontaneous speech.In other words, Japanese does not conform to the Western SVO (subject + verb + object) grammar.It has already been reported that the linguistic information inside English, Spanish, and Italian speech recognition models can be approximated by external language models.However, it is unclear whether the linguistic information learned in character-based Japanese ASR model can be approximated by a language model.

Related work
In this section, we first explain the language models used in automatic speech recognition systems; then, we introduce some methods used to integrate these language models and ASR models.

Language models
The language models used in ASR systems are probabilistic models that assign probabilities to sequences of letters and words.In other words, they are models that predict the likelihood of the occurrence of particular words or characters by inferring which words or characters are more natural in a given context.Typical language models include N-grams, which are statistical language models that use the probabilities of word chains containing N number of words, and RNN language models, which use recurrent neural networks capable of learning the properties of time-series data.In our experiments, we used an RNN language model given by the probabilities P(y l |y 0 , ..., y l−1 ).

Shallow Fusion
Many studies have been conducted on methods of integrating ASR and language models, and the most common method currently used is called Shallow Fusion [28].A conceptual diagram of Shallow Fusion is shown in Fig. 1.During Shallow Fusion, the output probabilities of an ASR model and a language model are added together in a logarithmic domain.When ASR model output y represents a sequence y 1 , y 2 , • • • , y L , where y l is a symbol output at each time frame and L is the length of the output sequence, and when the input for the decoder ( output encoder in Fig. 1) is expressed as x, the resulting output sequece ŷ is expressed as shown in Eq. (1): where log P ASR (y|x) is the output probability of the ASR model, which is the probability of inferring symbol label (1) ŷ = argmax y {log P ASR (y|x) + log P LM (y)}, sequence y when acoustic feature sequence output x from the encoder is given.The expression log P LM (y) in Fig. 1 represents the prior of y given by the language model, and is a weighting parameter to balance the output probabilities of ASR model and language model to maximize prediction performance, which is determined through the trials using data other than the test data.When using the Shallow Fusion method, the language model is only used during inference, and the language model and ASR model are trained independently.

Deep Fusion
Deep Fusion [11]  The Deep Fusion method uses a pre-trained ASR model and a pre-trained language model, which are first trained independently.The ASR model and the language model are then integrated by training a DNN, into which the hidden state information of the two pre-trained models are fed. (2)

Cold Fusion
A modified version of the Deep Fusion method, called Cold Fusion, has also been proposed [29].The ASR model is trained using linguistic information from a pretrained language model.The Cold Fusion method can be expressed as shown in Eq. ( 3): Here, l LM t is the logit output of the language model, and s t is the state of the ASR model.Gate value g t is trained using state h LM t of the LM, state s t of the ASR model, and weight parameters W and b.Here, s CF t is a concatenation of the vectors obtained by the Hadamard product of s t , g t and h LM t .Therefore, the state of the ASR model ( s t ) and the DNN output of language model ( l LM t ) can be concatenated to integrate their information.In addition, it has been reported that the performance of Cold Fusion is improved by using a fine-grained gating mechanism as the gate algorithm.
As mentioned previously, several other methods for integrating the ASR model and the language model have also been proposed, and these integration methods have also been reported to improve speech recognition performance.However, some of these methods are not theoretically correct and some require additional DNN training using in-domain speech data.As will be explained later, our method does not require retraining of the whole ASR models using any additional speech data.Only the language models need to be trained.Of the integration methods discussed in this section, only Shallow Fusion shares this characteristic, while the others require retraining of the ASR model retraining.Furthermore, the formulation of our proposed method is similar to that of Shallow Fusion but is theoretically different.To compare the performance of our proposed method with that of the Shallow Fusion method, we performed an experiment, which is described in Section 4 of this paper.

Adaptation of LM in end-to-end ASR model using language model replacement
We propose a method of adapting the language model in a conventional, pre-trained E2E ASR model in order to improve recognition of target domain speech.This is achieved by first estimating the "implicit language information" contained inside the pre-trained ASR model.During the inference stage, this estimated language information is used to eliminate the prior language (3) information learned from the source domain data that was used to train the original ASR model.Then, language information from the target domain, obtained from the LM of an independently trained ASR model, is combined with the language information in the adapted ASR model using a method similar to Shallow Fusion.A diagram of the proposed method is shown in Fig. 2. When a language model integration method is not used, the ASR model infers the sequence ŷ as follows: where log P source (y|x) is the log probability of output sequence y obtained from the source domain ASR model when the input sequence is x.Here, "source domain" refers to the task or activity during which the speech data used for training the ASR model was recorded, and x and y represent the input acoustic feature sequence and the output symbol label sequence, respectively.Log output probability log P source (y|x) from the ASR model can be expanded using Bayes' rule as follows: On the right side of Eq. ( 5), we can see that the ASR model includes the acoustic information term log P source (x|y) and the language information term log P source (y) , the latter of which is the "implicit" language information, contained in the ASR model, as described above, i.e., the statistics of the language contained in the source domain speech data used for training the ASR model.Shallow Fusion methods do not take this "implicit language information" into account.However, ASR models do take advantage of this "implicit language (4) ŷ = argmax y {log P source (y|x)}, (5) log P source (y|x) = log P source (x|y) + log P source (y) − log P source (x) Fig. 2 Language model replacement information" to improve decoding when the domain of the test data is the same as the domain of the training data.But if the test domain is quite different from the training domain, this information can cause degradation of the ASR model's decoding performance.
Our method attempts to remove the "implicit language information" from the pre-trained ASR model using Bayes' rule.Assuming that the "implicit language information" from the source domain contained within the trained ASR model can be approximated by an external language model trained using text data from the same source domain, this source domain language information can be removed by subtracting the output probability of the external language model from the output probability of the ASR model for the source domain as follows: where log Psource (y) is the probability of the external lan- guage model for the source domain, and sub is a subtraction weight for balancing the acoustic and language information, which compensates for the estimation error of P source (y) , that is, the difference between P source (y) and Psource (y) .Equation ( 6) can also be thought of as an esti- mation of the log output probability of a purely acoustic model.
The language information can then be replaced by adding the log output probabilities of the external language model trained for the target domain to Eq. ( 6), as follows: where log Ptarget (y) is the probability of the external lan- guage model for the target domain, and add is an addition weight.Here, P (source, target) (y|x) represents a model with acoustic information from the source domain and language information from the target domain.To maximize recognition performance, sub and add are estimated using a grid search of a target domain dataset which is different from the test data.When the estimation of the implicit language infromation is accurate, the value of sub is expected to be around 1.0.By using Eqs.( 4) to (7), our method successfully replaces only the source domain language information within the ASR model.Therefore, an ASR model can be created for any target domain simply by preparing text data for that domain.Furthermore, this method integrates the external language model and ASR model at the inference stage as in Shallow Fusion, so only the language models used for subtraction and addition need to be trained, and the ASR model does not (6) log P source (y|x) − sub log Psource (y) ∝ log P source (x|y) + log P source (y) − sub log Psource (y) ≈ log P source (x|y), (7) log P source (y|x) − sub log Psource (y) + add log Ptarget (y) ≈ log P source (x|y) + add log Ptarget (y) ∝ log P (source,target) (y|x), need retraining.Unlike Shallow Fusion, in which only the external in-domain language model is added, our method subtract implicit source domain language information, then add in-domain language information based on Bayes' rule.
As described in Section 1, several previous studies have proposed a similar formulation in which an implicit source domain language model and a target domain language model are added to the ASR model at some ratio.For example, McDermott et al. conducted experiments using an RNN-T decoder as an ASR model [32] and demonstrated that their approach is effective for English, Spanish, and Italian.
In this paper, we port this methodology into encoderdecoder ASR models and perform experiments using Japanese-language ASR tasks, which are more difficult than alphabetic language tasks due to the language's huge written vocabulary.

Datasets used in experiments
This study required the use of datasets from multiple domains in order to validate the proposed language model replacement method for adapting ASR models to a target domain, which is achieved by replacing the "implicit language information" contained inside an existing ASR model with target domain information.We used three Japanese-language datasets in our experiments: the Corpus of Spontaneous Japanese (CSJ) [19], the Japanese Newspaper Article Speech (JNAS) corpus [35], and the Mainichi Shimbun (MS) newspaper articles text dataset [36].Table 1 shows the details of the datasets used in our experiment.The CSJ APS corpus and CSJ SPS corpus were both randomly split into a training set, "dev1 set, " "dev2 set, " and test set, at a ratio of 9.0:0.5:0.25:0.25,respectively.The training and "dev1" sets were used as training and development data, respectively, when training the ASR models, while the "dev2" set was used for tuning language model weights, and the test set was used for performance evaluation.The number of unique Japanese characters contained in the CSJ data used in this experiment was 3262.The MS dataset is a text dataset consisting of a total of 58,944,516 non-unique characters from daily newspaper articles and was only used to train the external "target domain" language model.The JNAS dataset was created by having speakers read aloud excerpts from Mainichi Shimbun newspaper articles; thus, the JNAS speech data is from the same domain as the MS text data.JNAS dataset was split into a "dev" set and test set.The "dev" set was used as data for language model tuning and the test set as evaluation data for our second experiment.

ESPnet
The End-to-End Speech Processing Toolkit, ESPnet, is an open-source speech processing toolkit specialized for end-to-end models [37,38], which contains several types of ASR models.We used the RNN (Hybrid CTC/Attention Architecture) and Transformer (Joint CTC Attention Transformer) models in our experiments.

Hybrid CTC/Attention Architecture
Figure 3 shows a diagram of a Hybrid CTC/Attention Architecture ASR model.First, the input acoustic features are formatted using VGGnet [39]; then, they are converted into intermediate representation H by six BLSTM (Bidirectional LSTM) layers which are used as the encoder.The decoder consists of one LSTM layer and one Linear layer.An additional Linear layer is used as the CTC decoder.

Joint CTC Attention Transformer
Figure 4 shows a diagram of a Joint CTC Attention Transformer ASR model.The encoder consists of a stack of N = 18 identical layers.Each layer has two sub-layers, one of which is a multi-head self-attention mechanism, while the other is a simple, locally fully connected feedforward network.This method employs a residual connection around each of the two sub-layers, followed by  layer normalization.Thus, the output of each sublayer is the Layer-Norm.
The decoder consists of a stack of Q = 6 identical layers.In addition to the two sub-layers used in each encoder layer, the decoder contains three additional sub-layers to perform multi-head attention on the output of the encoder stack.Similar to the encoder, it uses a residual connection around each sublayer, followed by layer normalization.It also modifies the self-attention sublayer of the decoder stack so that positions do not join subsequent positions.Further, in order to supplement the input acoustic information, a CTC composed of one Linear layer is used at the top of the encoder.

Implementation of language model replacement to ESPnet models
Figure 5 shows a diagram of the proposed language model replacement (LMR) method when applied to an ESPnet2 encoder-decoder ASR model using Transformer as encoder-decoder, where language model replacement is applied to the decoder 1 .The decoder output of the RNN model is constructed using an RNN layer such as an LSTM, which attempts to predict the current state of an utterance from its past state.The decoder of the Transformer ASR model is composed of an attention layer and a feed-forward layer, but it performs masking and is also trained to predict the current state from the past state, i.e., the decoder has been trained to include linguistic information.We first subtract the output probability of the language model trained in the source domain from the output probability of the decoder of the encoderdecoder ASR model trained in the source domain.The subtraction of this language model is intended to remove the language information contained in the decoder.Then, the language information in the language model can be tuned to the target domain by adding the output probabilities of the language model for the target domain.This method is also effective for ASR models which use a beam search, such as the Conformer model.

ASR and language models
In this section, we describe the six ASR models and four language models used in our experiments.
• ASR models.A Hybrid CTC/Attention Architecture ASR model and a Joint CTC Attention Transformer ASR model were each trained using either APS corpus, the SPS corpus, or both the APS and SPS corpora.We used ESPnet's Hybrid CTC/Attention Architecture and Joint CTC Attention Transformer models, as described in Section 4.2.1.All of the models were set up using ESPnet's CSJ recipe.• Language models.Four LSTM language models were trained using either APS text data, SPS text data, APS and SPS text data, or Mainichi Shimbun text data.We used the LSTM language model provided by ESPnet, which consists of an Embedding layer, two LSTM layers, and a Linear layer, for all four of the language models. 1 Encoder-decoder models from ESPnet2 use a CTC decoder to align output in monotonic order.Strictly speaking, the CTC decoder output must also be compensated for during language model replacement.Theoretically, the implicit language information in the CTC decoder can be simulated using unigram [40]; however, context-dependency has been observed; thus, language model replacement is difficult.This remains a task for future work.

Differences among language domains
We calculated test set perplexities for the language models to analyze the differences among the language domains of the datasets, using four RNN language models.Perplexity P of a probability distribution p LM , describing a language model which output a probability of y l when given history y 1 , is shown in Eq. ( 8): where y l indicates the l-th character in the test set reference transcription and where the sequence ) is a transcription of the test set.Table 2 shows the perplexities of the test sets for each model.
When the APS language model is evaluated with the APS test set, the perplexity is 17.33.On the other hand, when the APS language model is evaluated with the SPS test set, the perplexity is 33.37.Almost same result was obtained when evaluating the SPS language model with the SPS and APS test sets.This indicates that the data in the APS and SPS datasets are from linguistically different domains (academic and non-academoic presentation speech, respectively).
When evaluating the language models trained using APS+SPS with the APS+SPS test set and JNAS test set, we obtained perplexities of 16.28 and 80.53, respectively, demonstrating that the difference between the domains of the APS+SPS dataset and the JNAS dataset is very large.This is because the APS and SPS are corpora of spontaneous speech, whereas the JNAS dataset is a read speech corpus of newspaper articles.As a result, the language model trained with Mainichi Simbun data obtained a much lower perplexity when processing the JNAS test set.

Experiment 1: ASR tasks involving different language domains
We compared ASR performance when the source and target domain language models were not integrated (baseline method), when a conventional, Shallow Fusion language model adaptation method was used, and when our proposed language model replacement (LMR) (8) method was used, using each of the six encoder-decoder E2E ASR models and the four language models described in Section 4.3.We evaluated each model's performance using its character error rate (CER) when processing the test set.Our ASR system uses Japanese characters as units for recognition.We counted the number of substitutions (S), deletions (D), insertions (I), and correct characters (C) when calculating the CER.When N represents the number of characters in the reference, and N = S + D + C , CER is calculated as described as below: The APS and SPS test sets were used, for cross-domain evaluation, respectively.We used the dev2 sets to tune addition weight add of the Shallow Fusion equation and subtraction and addition weights sub and add of the LMR equation.The language model weights were optimized using a greedy search in the range of 0.1 to 1.1 in increments of 0.2.Experimental results for the Hybrid CTC/Attention Architecture and Joint CTC Attention Transformer ASR models are shown in Tables 3 and 4, respectively.For reference, we also show the CERs of the APS ASR model when using the APS test set, and the SPS ASR model when using the SPS test set, i.e., the result under matched domain conditions.
As shown in Table 3, when we used the APS-trained Hybrid CTC/Attention Architecture ASR model with the SPS test set, our proposed LMR method achieved a CER of 13.4%, outperforming the Shallow Fusion integration method which achieved a CER of 14.9%.When we use the SPS-trained Hybrid CTC/Attention Architecture ASR model with the APS test set, the LMR method achieved a CER of 16.7%, while the Shallow Fusion method achieved a CER of 18.5%, a relative reduction in error of 9.7%.As shown in Table 4, when using the APS-trained Joint CTC Attention Transformer ASR model and the SPS test set, the LMR integration method obtained a CER of 9.8%, better performance than the Shallow Fusion integration method, which achieved a CER of 10.8%.When using the SPS-trained Joint CTC Attention Transformer ASR model with the APS test set, the LMR integration method achieved a CER of 12.7%, while the CER when using Shallow Fusion was 14.3%.Thus, our proposed LMR integration method achieved better recognition results than Shallow Fusion in encoder-decoder E2E ASR models when performing cross domain recognition tasks in Japanese.
We then optimized the language model weights using the dev2 datasets.CERs when using various weights are shown in Tables 5 and 6.The horizontal axis values represent language model subtraction weights, while the vertical axis values represent addition weights.Vertical  column sub = 0 represents results when using the Shal- low Fusion integration method.Overall, we can see that the optimal addition weights for LMR are larger than the optimal additional weights for Shallow Fusion.This suggests that LMR integration allows the ASR model to use language from the target domain more effectively than Shallow Fusion.In each experiment, the optimal subtraction and addition weights for the dev2 set are almost the same as those for the test set when either LMR or Shallow Fusion were applied; thus, optimization of these weights is stable.In general, when using Shallow Fusion, CER increases after the the language model addition weight exceeds a certain value, while for LMR, this increase is observed for the both the addition and subtraction weights.In other words, providing excessive or insufficient linguistic information when using LMR both leads to a decrease in ASR performance.

Experiment 2: ASR tasks involving domains with different speaking styles
We also investigated ASR performance when integrating source and target doamains with different speaking styles.The ASR model trained with the APS+SPS dataset, which combines data from two spontaneous speech corpora, was adapted to the JNAS read newspaper article speech domain using the language model trained with the MS dataset.The weights for the language model were tuned using the JNAS dev set and optimized using a greedy search with a range of 0.1 to 1.1 and a step size of 0.2.This range for the subtraction weight is used because if the estimation of the implicit language information is accurate, the search result value is expected around 1.0, but if estimation is inaccurate, the subtracted data might be 'noise' and thus the search result value becomes low.As for the addition weight, the search result value depends on the degree of correlation with the test data.Thus, if the language model information matches the test data, the search result value is expected to be around 1.0 and if it is not well matched, the search result value will be low.ASR performance was evaluated using CER when processing the JNAS test set.Our experimental results are shown in Tables 7 and 8. CERs when using the LMR and Shallow Fusion adaptation methods with the Hybrid CTC/Attention Architecture ASR model were 15.9% and 16.6%, respectively, a reduction in relative error of 4.2% when using LMR method.When using the Joint CTC Attention Transformer ASR model, the LMR  integration method achieved areduction of 0.5% in absolute CER compared to the Shallow Fusion method.Thus, we were able to confirm that the proposed LMR adaptation method was more effective for ASR tasks involving domains with different speaking styles, even when acoustic adaptation was not applied.ASR model performance for experiment 2 when using various language model addition and subtraction weights are shown in Tables 9 and 10.Compared to the result of domain adaptation experiment 1, using only the APS and SPS corpora, the optimal values for the subtraction and addition weights were smaller in experiment 2. As shown in Table 2, we observed that JNAS test set perplexity for the MS language model is 33.43, while the matched perplexities for the APS and SPS language models are both about 17; therefore, the linguistic constraints of the MS language model on the JNAS corpus are relatively small, which might be the reason for the smaller optimal language model weights in experiment 2.

Conclusion
In this paper, we have proposed a method for replacing "implicit" source domain language information contained within the language model of an ASR model with language information from a target domain language model, in order to efficiently adapt pre-trained E2E ASR models to a target domain.This method is based on the Bayes' rule and does not require re-training of the ASR model for adaptation.
Our first language model adaptation experiment, conducted using encoder-decoder models trained with the APS and SPS corpora of the Corpus of Spontaneous Japanese, showed that the proposed "language model replacement" (LMR) method achieved better ASR performance than the conventional Shallow Fusion method    when integrating language models for different domains.
In a second experiment, language models trained using corpora with different speaking styles were integrated.A language model trained with spontaneous Japanese presentation speech and a language model trained with Japanese newspaper article read speech were integrated.
Our proposed LMR method also outperformed Shallow Fusion in this experiment.Finally, our analysis of the magnitude of the language model weights used to add linguistic information implied that the proposed "language model replacement" method made better use of the target domain language information than the Shallow Fusion integration method, based on ASR performance in terms of CERs.

Table 1
Details of datasets used in experiment

Table 2
Perplexities for each language model and test set

Table 3
Speech recognition results (CER (%)) for Hybrid CTC/Attention Architecture model for baseline (B/L), Shallow Fusion (SF), and language model replacement (LMR) methods for ASR tasks involving different language domains

Table 4
Speech recognition results (CER (%)) for Joint CTC Attention Transformer model for baseline (B/L), Shallow Fusion (SF), and language model replacement (LMR) methods for ASR tasks involving different language domains

Table 5
Hybrid CTC/Attention Architecture model results (CER) for ASR tasks involving different language domains when using various language model addition and subtraction weights, for Shallow Fusion ( sub = 0.0 ) and for the language model replacement method

Table 6
Joint CTC Attention Transformer model results (CER) for ASR tasks involving different language domains when using various language model addition and subtraction weights, for Shallow Fusion ( sub = 0.0 ) and for the language model replacement method

Table 7
Hybrid CTC/Attention Architecture model results (CER) for baseline (B/L), Shallow Fusion (SF), and language model replacement (LMR) methods for language models with different speaking styles

Table 8
Joint CTC Attention Transformer model results (CER) for baseline (B/L), Shallow Fusion (SF), and language model replacement (LMR)methods for language models with different speaking styles

Table 9
Hybrid CTC/Attention Architecture model results (CER) for ASR tasks involving different speaking styles when using various language model addition and subtraction weights, for Shallow Fusion ( sub = 0.0 ) and language model replacement

Table 10
Joint CTC attention Transformer model results (CER) for ASR tasks involving different speaking styles when using various language model addition and subtraction weights, for Shallow Fusion ( sub = 0.0 ) and language model replacement