Semantic Features Based N-Best Rescoring Methods for Automatic Speech Recognition

: In this work, we aim to re-rank the n-best hypotheses of an automatic speech recognition system by punishing the sentences which have words that are semantically different from the context and rewarding the sentences where all words are in semantical harmony. To achieve this, we proposed a topic similarity score that measures the difference between topic distribution of words and the corresponding sentence. We also proposed another word-discourse score that quantiﬁes the likeliness for a word to appear in the sentence by the inner production of word vector and discourse vector. Besides, we used the latent semantic marginal and a variation of log bi-linear model to get the sentence coordination score. In addition we introduce a fallibility weight, which assists the computation of the sentence semantically coordination score by instructing the model to pay more attention to the words that appear less in the hypotheses list and we show how to use the scores and the fallibility weight in hypotheses rescoring. None of the rescoring methods need extra parameters other than the semantic models. Experiments conducted on the Wall Street Journal corpus show that, by using the proposed word-discourse score on 50-dimension word embedding, we can achieve 0.29% and 0.51% absolute word error rate (WER) reductions on the two testsets.


Introduction
Language modeling, which learns the probability distribution of a given word and captures the extent to which a sequence of words can be considered well formed, is an important part of automatic speech recognition (ASR) tasks. It provides a context to distinguish between words and phrases that are pronounced in a similar way. This is because n-gram language models use the assumption that the probability of the i-th word w i given the context history of the preceding i − 1 words can be approximated by the probability of observing it in the context history of the preceding n − 1 words. Generally, n-gram language models are used in the first-pass decoding process in ASR tasks as only a small number of historical words are needed. Along with the 1-best hypotheses, which could be the final outputs of the system, lattices and n-bests hypotheses can be generated by the first pass decoding process. Other language models like neural network language models are usually used to rescore lattices or n-best hypotheses [1,2].
Here is an example of some of the hypotheses of the utterance named "446c040q" of a Wall Street Journal (WSJ) dataset: Even without listening to the original speech, people who can speak English and have common knowledge can easily point out that the second hypothesis is the most possible one as the word "surge" fits the topic of the sentence more and only with "surge" does the sentence make sense.
Inspired by this, we propose to use a score to measure whether a word is abrupt in the context according to semantic information, and use the score to rescore the hypotheses to promote the performance of ASR systems.
Most of the work of using semantic features is based on neural network language models. Neural network language models can often offer excellent results, but they have a severe requirement of machine and time for training, especially when the training set or vocabulary set is very large. In addition, neural language models lack flexibility. A small modification on the vocabulary list could lead to retraining the whole network. Thus, in this work, we offer a different option by using the semantic features without neural networks.
Our task has some similarity with ASR error detection. ASR error detection focuses on distinguishing whether the words in ASR results are wrong by using both decoder based and non-decoder based features. The methods are usually tested on the classification error rate of classifying words to right and wrong classes. Generally speaking, the 1-best hypothesis (hypothesis with the highest score of the corresponding utterance) is not always the hypothesis which leads to the lowest WERs. Therefore, an effective hypotheses selection can be of great help. Thus, in this work we do not focus on differentiating whether a certain word is right or wrong, but propose a score to evaluate on what degree the word is appropriate in the hypothesis sentence by semantic features. In this work, we use the latent semantic marginal (LSM) [3] and a variation of log bi-linear language model [4] to form a sentence coordination score. In addition to that, we propose a word-sentence topic similarity score and a word-discourse probability score based on the theory of Arora's text generation hypotheses [5]. All the four scores are designed to be light-weight, i.e., they do not need parameters other than topic models or word embedding models. We use the four scores in the n-best rescoring process and show the effectiveness of each score on improving ASR systems.
As shown by the previous example, there are many common words in hypotheses lists. Moreover, it is obvious that if a word appears at similar positions in all the hypotheses, whether this word is the correct word does not change the result of the rescoring process. We should focus more on the words that have more candidates. Thus, we first align the hypotheses to group the words in different hypotheses which are the recognized results for the same time period. Then we introduce a fallibility weight, which assists the computation of the sentence's semantic coordination scores by instructing the model to pay more attention to the words that may have more possible choice.
The structure of this paper is as follows: In Section 2, we introduce some related works. In Section 3, we introduce the topic model and word embeddings which are the semantic features we use in this research. In Section 4, we present our rescoring method. We first introduce the four sentence scores. Then we describe the fallibility weight of words. Finally we introduce how we use the sentence scores in the ASR systems. We present experiments in Section 5 to validate our method and discuss some of the parameters in the method. Section 6 presents the conclusion.

Related Works
Topic information as a kind of semantic information has been used in language models in many works. Chu and Mangu performed a hard partitioning of the data according to the topics and built a set of disjoint models to depict data of different topics respectively [6]. Jin et al. [7] mapped each paragraph into a unique vector in continuous space and performed unsupervised clustering to construct topic clusters and then built several topic-specific language models like [6]. Lau et al. [8] proposed a neural language model that incorporates document context in the form of a topic model-like architecture to provide a succinct representation of the broader document context outside of the current sentence. Latent Dirichlet allocation (LDA) [9] is the most popular topic modelling algorithm because it is less prone to overfitting and it can be used to compute the probability of an unobserved document [3]. Mikolov et al. proposed using topic information from LDA as an external feature in recurrent neural network language models [10]. LDA has been used in language model adaption in [3,[11][12][13]. It has also been used to divide a training set into clusters by the topic distribution in [14,15].
There is an hypotheses generator to forecast air traffic control voice commands taking into account context knowledge [16]. However, this concentrates more on a semantic error rate than on a word error rate. The hypotheses generation for air traffic control can also be improved by machine learning [17].
Word embedding [18][19][20] projects each word to a vector in a continuous vector space, by making words with similar context closer in the space. Many works have analyzed the result or the word vectors by experiments on polysemy or studies on the similarity of nearby words or sentences [5,21,22]. Most works using word embedding in rescoring n-best also build on the work of neural network language models. Audhkhasi et al. [23] used word embedding as inputs in his proposed feedforward neural network language models' architecture. Besides adding word embedding to the hidden layer, character embeddings are additionally added to the hidden layer and output later in a Chinese speech recognition task in by He and Xiang et al. [24].

Topic Model and Word Embeddings
In this section, we introduce the topic model and word semantic vector model that are used in the method we propose.

Latent Dirichlet Allocation for Topic Modelling
Latent Dirichlet Allocation (LDA) [9] is a text generation model which has a hierarchy structure of document, topic and word. For a document, the generation process can be described as follows:

•
Sample the length N of the document from Poisson distribution: N ∼ Poission(ξ). • Sample a multinomial distribution over topics for document i from a Dirichlet distribution parameterized by α: Θ i ∼ Dir(α).

•
For the j-th word in the document, sample the topic of this word, z i,j , from Θ i , z i,j ∼ Multinomial(Θ i ) and then sample the word w i,j from the unigram distribution given the topic: The key parameter α controls the sharpness of probability distribution over topics when choosing topics for a document. When α = 1, the topic is uniformly picked. When α > 1, the probabilities of more uniformed ones are given with larger probabilities. When α is < 1, the probability tends to peaked distributions over the topic. Our experiment uses the LDA implementation by Blei et al. (https://github.com/blei-lab/lda-c).
When given the training data, topic number and the initial α, the algorithm learns the best value of α, the word generation probability of topic β and the topic distribution γ for the training documents by iteration. There is also an inference program in the LDA tool kit to tell the topic distribution γ for the given document from the model learned by the training procedure.

Word Embedding
Word embeddings are distributed representations of words in a vector space, which help learning algorithms to achieve better performance in natural language processing tasks by grouping similar words. Many different approaches have been investigated to generate the word embeddings of word given some training text. In this paper, we use the widely used GloVe algorithm [20] to generate word embeddings. GloVe is based on the co-occurrences of words in a window. Let X be the word-word co-occurrence matrix, of which entry x i,j denote the count of i-th and j-th word co-occurrence in a window. Let x i = ∑ j x i,j denote the number of times any word appears in the context of the i-th word.
The GloVe can be regarded as a weighted least square regression with the cost function as follows: where V is the size of vocabulary, w ∈ R d andw ∈ R d are word vectors, b andb are biases. As X is symmetric, W andW are equivalent and differ only as a result of their random initializations. The sum W +W is the final output of the algorithm. The weighted function should have the following properties: f (x) should be non-decreasing in case that rare co-occurrences are overweighted.

3.
f (x) should be relatively small for large value of x, in case that frequent co-occurrences are overweighted.
In practice, f (x) are usually chosen as follows: Figure 1 shows the model architecture of GloVe. The input is a one-hot representation of a word. The word embedding matrices serve as weight matrices in the model and thus the output of the model is a vector of inner products of word vectors. The embedding matrices are updated by the gradient of the loss function introduced before in this section. The text generation process is modelled as a random walk in the word vector space by Hashimoto et al. [22]. Given that the last word produced was w, the probability that the next word is w is assumed to be given by h(|v w − v w | 2 ) for a suitable h. Arora et al. [5,21] further proposed a latent discourse vector in the random walk model, which has a clear semantic interpretation and has been proven useful in subsequent work, e.g., in their work of understanding structure of word embeddings for polysemous words.
In their model, the text generation process is driven by the discourse vector c t ∈ R d , which is a vector to express what is being talked about. Each word has a continuous representation w t ∈ R d and the probability of a word is emitted at time t given the discourse vector is modelled with a log-linear word production model by Arora [21] The discourse vector c t does a slow random walk, and c t+1 can be obtained by adding a small random vector from c t . Therefore, nearby words are generated under similar discourse.
With the assumption that the discourse vector c t does not change much when modelling the word emitting procedure, it can be simplified to replace all the c t 's in the sentence s with a sentence discourse c s . It was shown in the paper of Arora et al. [5,21] that the maximum a posteriori estimation of c s is the average of word embeddings for all the words in the sentence. So, the word emitting probability can be writen as follows:

N-Best Rescoring with Coordination Scores
In this section, we first introduce the scores, which are used for evaluating how much is the sentence coordinate in topic or semantic. And then we introduce how we use the coordination scores to rerank the n-best hypothesis.

Sentence Probability Score Using LDA
To discriminate whether a word is suitable in a document, we could use the probability of a word given the topic estimation of the document. The probability can be computed by a weighted average of the probability of this word given each topic, for which the weight is the topic distribution of the document, i.e., where |T| is the hyper-parameter of topic numbers in the LDA model training progress, and z j represent the j-th topic. We use the item in Row i Column j of the β matrix for P(w i |z j ) calculated by the LDA tool kit mentioned in the previous section and the corresponding row in γ for the hypothesis sentence s by the infer procedure in the tool kit for P(z j |s). In re-ranking tasks, we regard the hypothesis sentences as the given document. This probability is regarded as LSM [3] and it is used to measure the coordination of a sentence for the first time here.
The probability of a word sequence, which is regarded as a bag-of-word (BOW) in this word, given the sentence topic distribution of a sentence is the product of the probabilities for each words in the sentence given the sentence topic distribution when simplified that the words are independent. It can be written as: where L is the length of the document, namely the hypothesis sentence. We use the logarithm of the word sequence probability as the sentence coordination score: 4.1.2. Topic Similarity Score Using LDA To discriminate the mistakenly recognized word that differs in the topic with the sentence s, we use the cosine similarity to measure word-sentence topic bias: where P(z|w i ) and P(z|s) are both |T|-dimensional vectors, which are the topic distributions of w i and sentence s respectively. Provided that in LDA algorithm, all the topics are equally modelled in the training process (P(z j ) = 1/|T|, for all j), the probability of topic distribution given word can be normalized by the word probability distribution of each topic: Thus the topic probability given word can be calculated by column normalization of the β matrix from the LDA pre-training model.
The range of cosine similarity is [−1, 1], where the value close to 1 means that the topic of the word is very similar to the sentence, and in contrast the value near −1 means that the topic of the word is nearly the opposite to the sentence.
We calculate the sentence topic similarity score by simply adding the word similarities:

Word-Pair Probability Score Using Word Embedding
Let p i,j be the probability that word j appear in the context of word i. It is apparent that p i,j = x i,j /x i . And according to the last section, the logarithm of x i,j can be approximate to w T i w j . Thus, p i,j can be calculated as follows: where W = [w 1 , w 2 , . . . , w V ] and (•) j refer to the j-th element of (•). The calculation can be further modified by γ to control the shape of the probability distribution as follows: γ is the smoothing factor, which can be regarded as a stretch of all word vectors. γ controls the shape of the probability distribution. When γ > 1, the probability distribution will be sharpen and when γ < 1, the distribution will be flatten.
The word pair probability is similar to the log bi-linear (LBL) language model [4]. It can be regarded as a bi-gram language model which is smoothed by the inner product of the continuous expression for the rare word-pairs. To avoid repetitive computation and accelerate the calculation, we can pre-compute and store all x i . By pre-computing, the probability of a word pair can be calculated in O(d) time. d is the length of word vector.
Usually, the context of a word is defined by the previous two words and the preceding two words of the current word. We use this setting for both word embedding training and word-pair probability prediction. We use the average of conditional probabilities of the current word given words in the context of the current word. The probability can be expressed as follows: when i − 2 or i − 1 is less than 0, or i + 1, i + 2 is larger than the length of the sentence, we use zero as the word-pair probability. And k is the number of non-zero probabilities of the 4 items. The word probability score for the i-th word is the logarithm of p i . We used a window with exponential weights to emphasize the words near the current word and weaken the effect of words further to the current word. But the result is at a similar level to that of the noncentral rectangular window.

Word-Discourse Probability Score Using Word Embedding
As explained in the second section, the probability of a word given a sentence is proportional to the exponent of the inner product of the word embedding of the word and the discourse vector, which can be simplified as the average of word embeddings of all the words in the sentence. The probability of the t-th word in sentence s is calculated as follows: where w t is the word embedding of the t-th word in sentence s, V is the size of vocabulary. c s is the discourse vector introduced in the previous section. As all words in the same sentence share the same discourse vector, we firstly compute c s when we begin to process the sentence. Therefore, the computation of word-discourse probability score is more time-consuming than the word-pair probability score. Similar to the topic similarity score using LDA, the sentence word-discourse probability score is the sum of the logarithms of word-discourse probabilities for all the words in the sentence.

Fallibility of Word for ASR Hypothesis
The n-best hypothesis of ASR result has a trait that given any two hypothesis for the same utterance, the mass of the hypothesis pairs have only a few words that are different, which can be instantiated by the example in the introduction.
To rerank the hypothesis of the ASR result, we tend to focus on the different words in the hypotheses and ignore the words that exists in every hypothesis. Motivated by this, we introduce the notion of fallibility of word to the rescoring progress to magnify the influence of the words that exists less in similar places in the hypothesis list.
We first align all the hypothesis pairs for each utterance. For each word in a hypothesis, we collect all the words that correspond to the required word in the aligning results in a set. (Each time the required word aligned with blank, we regard blank as a unique word.) The fallibility of a word in a hypothesis is defined by the number of unique words (namely number of types) for the words that are different from the required word in the set. For example, the fallibility weight for the word "surged" in hypothesis 446c040q-2 (presented in the introduction) is 2 as word "search" and "sir" both correspond to and different from "surged". Figure 2 is an example to show how to compute the fallibility. The aligning operation can be implemented by the minimum edit distance algorithm [25]. We use F(m, n) to denote the minimum edit distance for the first m words and first n words for the two sentences, S 1 and S 2 . F(m, n) is computed by F(m − 1, n), F(m, n − 1) or F(m − 1, n − 1), each of which represent a way of tagging the present word S 1 (m) and word S 2 (n): F(|S1|, |S2|) is the final minimum edit distance of the two sentences. We can keep the path by which F item F(m, n) is computed and tag each word to distinguish the word to be correct or error. Figure 3 is an example of the minimum edit distance algorithm. In this example we align the word "monkey" and the word "money" by character. The left table is the F matrix and the bottom right element is the final edit distance for the two words. The right matrix stores the path. For the (i, j)-th element, "0", "1" and "2" mean that F(i, j) is assigned by F(m − 1, n) + 1, F(m, n − 1) + 1 and F(m − 1, n − 1) + 0/1 respectively. From the path matrix, we can label words in hypotheses with "correct", "substitute" and "insert" tags. For example, if the vertical words refer to the reference and the horizontal words refer to hypothesis, "1" means that the corresponding word in hypothesis is of "insert" tag, and "2" means that the word is of "correct" tag or "substitute" tag, which depends on whether the corresponding words are different. We could not give a word "delete" tag in hypothesis, because the "deleted" word is not in the hypothesis.
To emphasize the fallible words, we multiply the word score with the word fallibility score when computing the sentence score.

n-Best Rescoring for ASR
Hypotheses of ASR are ranked by the sum of acoustic model score (Score AM ) and language model score (Score LM ), which are the logarithms of the state probability and word sequence probability respectively: Score(s) = Score AM (s) + λScore LM (s) (18) We interpolate the score of the base language model and the similarity or probability score proposed in last subsection by α to get the new language model score: where k is a normalizer as the scale of Score proposed might differ greatly with that of Score LM especially for the scores with fallibility weights. We use the proportion between the median of Score proposed and Score LM for k at first and change the value of it to get the best result.

Experimental Setup
In this study the proposed rescoring method is carried out on the WSJ speech recognition task (https://catalog.ldc.upenn.edu/docs/LDC94S13A/wsj1.txt). The dataset contains an 80-hour training set and a text corpus which has 37M tokens, and 162K words for language modeling. We use Kaldi [26] for acoustic model training and decoding. The acoustic model is a time-delay neural network (TDNN) with 6 layers, each layer with 250 hidden units. The acoustic model trained by the cross-entropy (CE) loss. The first-pass decode uses a 3-gram language model with Kneser-Ney (K-N) smoothing by srilm [27]. The large text corpus is used for n-gram and the small RNN language models and the training of the semantic models for LDA and GloVe. In this study, dev93 and eval92 are used equally to evaluate the proposed methods.
The number of hypotheses used for re-ranking is 100 for both eval92 and dev93. We follow the kaldi's recipe to enumerate from 9 to 20 to find the best λ in Equation (19). The best choice for eval92 is 13. Both 12 and 13 yield the best results for dev93. We use 13 for both testsets in rescoring.

LDA-Based Rescoring
In this section, we use LDA-based scores to re-rank hypotheses. The average scores of hypotheses at different ranks are in Figure 4. The upper sub-figures show the average logarithms of sentence probability and the ones at the bottom are the averages of topic similarity. "LDAprob" refer to the LDA sentence probability score, and "LDAtopsim" refer to the topic similarity score in this table. The tendency of the curves shows that the hypotheses ranked at the front are of higher scores than those ranked at the back for all sub-figures. Given the assumption that generally hypotheses ranked in front are of better quality, the curves could reflect this to some extent. The average scores with fallibility are in the right two sub-figures.
We aligned all the hypotheses in both testsets with references and tag words in hypotheses with 'correct', 'substitute' and 'insert' by the aligning algorithm described in Section 3.2. Deletion error cannot be reported in hypotheses, so we do not have a 'delete' tag. Figure 5 shows the distribution of the two word scores for words with 'correct', 'substitute' and 'insert' tags respectively. The sub-figures on the top are those of word probability score, and the bottom ones are of topic similarity score. The statistic of the averaged score for words with 'correct' 'substitute' and 'insert' tags is listed in Table 1. As shown in the table, for both score-types, the average score of the 'correct' tags are the highest and the average of 'insert' words are much lower than 'correct' and 'substitute'. This shows that the two scores can discriminate the wrong words in the hypotheses to a certain extent. The horizontal axis, ranging from 1 to 100, represents the rank of hypotheses by the 1-pass decoding process. The vertical axis represents the average score of the hypotheses at the corresponding rank. Figure 5. Distribution function of LDA based scores for words with "c", "s" and "i" tags. Table 1. The average scores for words with "c", "s" and "i" tags for Dev93 and Eval92 testsets.

Score-Type
Dev93-c Dev93-s Dev93-i Eval92-c Eval92-s Eval92-i The word error rate (WER) of the new best hypothesis sentences rescored by LDA-based scores are listed in Table 2. "Baseline" here means the WER of the original recognition system, where corresponds to Equation (18). As mentioned before, the word-pair probability score can be regarded as the smoothed 2-gram language model. So we list the result of the 2-gram language model, which is trained by SriLM on the large text corpus. "smallRNN" here refer to a 2-layer RNN with 100 hidden units each layer. In this experiment, α described in Equation (19) is around 0.95 and k is 1 for LDAprob without fallibility weight and 0.03 with fallibility. And α is 25 and 1.5 for LDAtopsim without fallibility and with fallibility respectively. The best result of LDAprob without fallibility weight for Dev93 and Eval92 testset are 9.38% and 6.59% respectively, and are comparable to a small scale RNN. The best result of LDAtopsim for Dev93 and Eval92 testset without fallibility are 9.45% and 6.70% respectively, both of which are higher than the LDAprob score. From the table, we notice that the number of topics, which is a hyperparameter in the topic model, has a strong effect on WER. For LDAprob score, 20 and 30 is a better choice for Dev93, while a topic model with only 10 topics leads to the best result for Eval92. With respect to LDAtopsim, the best choices for the both testsets are 10. We have conducted experiments on topic models of larger number of topics, and find that when n is more than 30, the result is worse as n increases. That is probably because that when n is large, the difference between topics is small. Moreover, for actual speech, if topics are similar, it is easy to transfer between topics. Thus, the larger topic model may reduce the score of correct words together with the incorrect words, and decrease the difference between correct and incorrect words. The fallibility weight reduces WER for all the features. For a topic similarity score with 40-dimension LDA, the fallibility weight reduces WER at 0.16% and 0.21% absolutely for the two testsets respectively from the same feature without fallibility weight. In addition, with fallibility, the rescoring results are less sensitive to the hyperparameter of a number of topics.

Word Embedding Based Rescoring
We use word embeddings to re-rank the hypotheses list in this section. Figure 6 is similar to Figure 4. In this figure, we show the average sentence word-pair score and word-discourse score at different ranks. Apparently, the sentences ranked at the front score higher than the sentences ranked behind, especially for the sentences ranked at the first 10 places. This shows that the word-pair score and word-discourse score reflect the basic trends of the sentence. Figure 6. The average score of word embedding based methods for hypotheses. The horizontal axis, raning from 1 to 100, represents the rank of hypotheses. The vertical axis represents the average score of hypotheses at the corresponding rank. Figure 7 shows the probability distribution of 'correct' tags, 'substitute' tags and 'insert' tags respectively like Figure 5. For both scores and both testsets, words with 'correct' tags tends to score higher. The average scores of words with the three tags are listed in Table 3.
The results of the word embedding based word-pair probability score and word-discourse probability score are listed in Table 4. In this experiment, α is around 0.95, which is the same as LDA based scores. k is around 1 for word-pair and word-discourse scores without the fallibility score and it is about 0.05 and 0.01 for a word-pair and word-discourse score with fallibility score respectively. Firstly, we can see from the table that features with all dimensions tested in the experiment reduce WER. As for the LDA based features, features with larger dimension do not result in more significant WER reductions. The best results for Dev93 and Eval92 testsets without fallibility weight are both from the 50-dimension GloVe feature, which are 0.20% and 0.44% absolute WER reduction respectively for word-pair score. For the word-discourse score, the 300-dimension GloVe performs well on the whole of the two testsets, which leads to 0.24% and 0.56% absolute WER reductions on Dev93 and Eval92 testsets respectively compared with the 1-best hypotheses. The fallibility weight reduces WER for most features. Similar to the effect on LDA-based scores, fallibility decreases the performance gap among features dimensions for both word-embedding based scores. Figure 7. Distribution function of word embedding based scores for words with "c", "s" and "i" tags. Table 3. The average scores for words with "c", "s" and "i" tags for Dev93 and Eval92 testsets.

Score-Type
Dev93-c Dev93-s Dev93-i Eval92-c Eval92-s Eval92-i In addition, we combine the LDA-based scores and word embedding based scores to discuss whether the scores can complement each other. We use the following equation to substitute Equation (19) to use the combination of the two kinds of scores: where k 1 and k 2 are the parameters selected in last experiments for each scores respectively. Related WER results are listed in Table 5. The feature dimensions used here are chosen by consideration of both testsets. HOwever, combining the LDAtopsim score and word-discourse score can reduce the WER of Eval92, compared with using only the 40-topic LDAtopsim score or 50-dimension word-discourse score. On the whole, combining the two kinds of score does not obviously further improve the model, which shows that the proposed LDA-based scores and word embedding based scores contain similar information.

Conclusions
In this paper we have presented a new way to use semantic information to rescore the n-best hypotheses without neural network language models. Two kind of semantic features, namely LDA topic features and GloVe continuous word representations, are used and tested in this work. We have designed four sentence coordination scores from the two semantic features. Among the four scores, two are inspired by the LSM and LBL language model respectively, and the other two are first proposed in this paper. In addition, we have designed a fallibility weight to assist the computation of the sentence semantically coordination score by instructing the model to pay more attention to the words that may have more possible choice.
Experiments show that a 10-or 20-dimension LDA based word probability score with fallibility weight reduces WER of Dev93 and Eval92 testsets by absolute values of 0.23% and 0.46% respectively. For word embedding features, the 50-dimension word-discourse probability score with fallibility and 300-dimension word-discourse probability score without fallibility make the most WER reductions for Dev93 and Eval92 testset respectively at 0.29% and 0.51%. Fallibility weight works well on most features for both LDA and word embedding based scores. Although it only reduced WER by a small scale, it strongly decreased the performance gap among features dimensions for both word-embedding and LDA based scores.
As future work, we plan to explore higher-order embeddings which can represent more words rather than only one. On the other hand, the effect of the smoothing parameter γ will be explored in the future work. In addition, the fallibility weight can be used in other rescoring experiments.
Author Contributions: C.L. contributed to the idea of this paper, designed the algorithms and writed the original draft. C.L., T.L. P.Z. was responsible for performing the experiments and the analysis. T.L. and P.Z. reviewed and edited the paper. Y.Y. was in charge of supervision and project administration.

Conflicts of Interest:
The authors declare no conflict of interest.