Neural-based automatic scoring model for Chinese-English interpretation with a multi-indicator assessment

Manual evaluation could be time-consuming, unreliable and unreproducible in Chinese-English interpretation. Therefore, it is necessary to develop an automatic scoring system. This paper proposes an accurate automatic scoring model for Chinese-English interpretation via a multi-indicator assessment. From the three dimensions (i.e. keywords, content, and grammar) of the scoring rubrics, three improved attention-based BiLSTM neural models are proposed to learn the text of the transcribed responses. In the feature vectorisation stage, the pre-training model Bert is utilised to vectorise the keywords and content, and a random initialisation is used for the grammar. In addition, the fluency is also taken into account based on the speech speed. The overall holistic score is obtained by fusing the four scores using the random forest regressor. The experimental results demonstrate that the proposed scoring method is effective and can perform as good as the manual scoring.


Introduction
Automatic assessment of interpretation has recently become one of the hot spots in the field of Computer-Assisted Language Learning (CALL). Most of the research on automatic scoring of spoken English such as reading task, focuses on the prosodic features (e.g. oral fluency and intonation) (Cheng, 2018;Lin et al., 2020;Shi et al., 2020). For example, Lin et al. (2020) assessed test takers' performance in the English follow-read task using a multitask learning framework. However, there is still a lack of effective solutions when tasks involving text features (e.g. keywords and grammar) are encountered, such as translation or question-answering. Although some studies have been carried out to address this issue, their performance of automatic scoring for large-scale oral English tests is still limited. In the following, we identify two main challenges facing the Chinese-English interpretation task: For keyword scoring, the scores are obtained by using the keyword synonyms, since the key points of the test takers' responses have a great impact on the scoring results. Therefore, the scoring model should consider not only the keywords of each test taker's response, but also the keyword synonyms (Yoon & Lee, 2019). Neural network has unique advantages in natural language processing, pattern recognition and other fields (Diao et al., 2022;Fang et al., 2020;Liang et al., 2022). We therefore develop a keyword synonym corpus, and use neural network models to learn the correlation between keywords and synonyms to obtain the score of the keywords.
For content scoring, traditional methods get the scores by calculating the semantic similarity between the test takers' responses and the reference answers. With the development of deep learning (Benkhelifa et al., 2020;Evanini et al., 2017;Li et al., 2020;Qian et al., 2018) in the field of natural language processing, the score of the content can be computed directly through deep learning methods. Qian et al. (2019) used deep learning methods to study the scoring of the content, and the results showed that their Pearson correlation coefficients were higher than the previous machine learning models. The domestic Dolphin team  applied the long-head attention mechanism to address the problem of content scoring and showed good accuracy. Therefore, we employ deep learning to assess the content in the automatic scoring of Chinese-English interpretation.
In this paper, an automatic scoring model to assess the Chinese-English interpretation quality is proposed. According to the Syllabus of Higher Education Self-study Examination for Business English Interpretation and Listening (hereinafter referred to as Syllabus), the evaluation should mainly focus on the key points and sentence content, and score will be deducted if grammar and pronunciation are so poor as to be unintelligible. Following the Syllabus, we select the keywords, content, grammar and fluency as the scoring indicators to build the scoring model. Comprehensive experiments conducted on real-world data to validate the effectiveness of the proposed approach.
The main contributions of this paper are as follows: • We develop a multi-indicator assessment that combines the keywords, content, grammar and oral fluency as the scoring rubrics for Chinese-English interpretation evaluation, which is more accurate and comprehensive. • We propose an automatic scoring model based on the multi-indicator assessment. Three improved attention-based BiLSTM models are employed to evaluate the keywords, content and grammar of the response. • We conduct extensive experiments on the dataset, and the experimental results demonstrate that our proposed scoring model outperforms the other baselines.
The remainder of this paper is organised as follows. Section 2 introduces the related work. Section 3 shows the details of the proposed scoring model for Chinese-English interpretation. Section 4 presents the data augmentation strategy adopted in this paper. Section 5 analyses the experimental results. Finally, we summarise this paper and outline the future work in Section 6.

Related work
Regarding automatic assessment of interpreting, researchers have tried to emulate what has been achieved in applied linguistics, machine translation, and natural language processing. A number of researchers have examined the relationship between utterance fluency measures and human raters' perceived fluency ratings for different modes of interpreting (Christodoulides & Lenglet, 2014;Han et al., 2020;Z.-W. Wu, 2021;Yu & van Heuven, 2017). For example, Han and Lu (2021) correlated eight utterance fluency features with fluency ratings of five expert raters and found that several temporal variables concerning speed and breakdown fluency had moderate-to-strong correlations with the fluency ratings.
With respect to automatic assessment based on linguistics features, Liu (2021) select the criteria of information accuracy, output fluency, and audience acceptability to evaluate the interpreting quality and employed statistical modelling based on decision tree analysis to train the assessing model, the results indicated that the proposed approach was capable of distinguishing students' interpretations of different qualities. Ouyang et al.'s (2021) conducted research on 67 Chinese-to-English consecutive interpretations samples from the All China Interpreting Contest (ACIC) and assessed the spoken-language interpreting quality based purely on linguistic indices from Coh-Metrix analysis.
In terms of automatic assessment of interpreting based on machine translation metrics (e.g. BLEU), Han and Lu (2021) explored to what extent machine translation metrics such as BLEU, NIST, METEOR, TER, and BERT correlate with human scores using different scoring methods, and the result showed there were moderate-to-strong correlations between most of the machine translation metrics and the human scores, which indicates the possibility to automate interpreting quality assessment based on the machine translation metrics.
In regard to automatic assessment of interpreting based on natural language processing, in Le et al.'s (2018) study, an interpreting corpus of 6700 utterances was built and was then fed into several word confidence estimation systems that combine 9 automatic speech recognition features for speech transcription and 24 features related to machine translation. In Stewart et al.'s (2018) study, the feature-based quality estimation model Quest++ was augmented with four additional interpreting-specific features to evaluate the interpreting quality.
Previous research shows that neutral networks have rarely been used for the automatic assessment of Chinese-English interpretation. To fill this gap, this paper proposes a neural network model based on Bert, BiLSTM and Attention mechanism (for short, Bert-BiLSTM-Attention) for the automatic scoring in the Chinese-English interpretation task.

The automatic scoring model for Chinese-English interpretation
This section will elaborate the proposed automatic scoring model. It first introduces the overall framework and process of the proposed model, and then describes the main components of the proposed model.

The framework
The overall process of the proposed automatic scoring model for Chinese-English interpretation is shown in Figure 1. First, manual transcription is performed on the test taker's spoken response. Based on the speech signal and transcribed text, the features of text and pronunciation are extracted. Then, the scores of keywords, content, grammar and fluency are obtained respectively. Finally, the four scores are weighted and summed by the score fusion model to generate the holistic score and its corresponding level.

Feature extraction and scoring of keywords
According to the scoring rubrics of Chinese-English interpretation (see Table 4), it is very important to assess the keywords and their synonyms in the test takers' responses. The following two aspects must be considered in order to appropriately evaluate the keywords in the test takers' responses: (1) the number of translated keywords in the responses. (2) The use of keywords and keyword synonyms in the responses.
The Bert-BiLSTM-Attention neural network model proposed in this paper has three features: (1) The first is the application of the Bert model, which can extract rich semantic features at the word level and the sentence level of the sequence.
(2) BiLSTM model can learn contextual features. (3) The attention mechanism is used to assign weights to the features so that the model can focus on the most important semantic information in the sentence.
We refer the keywords to be translated in a question as source keywords and keywords in a reference answer as reference keywords. Different from the traditional method that directly inputs the responses into the neural network to get the keywords score (Chen et al., 2018;Liang et al., 2021), by referring the method used in the cross-language task (Zhou et al., 2018;, we concatenate the reference keywords and the source keywords to the test takers' response, and then train them with the neural network model in that the semantic information of the source and reference keywords can provide more semantic information and help the model to learn the relation between the keywords and the score better. Meanwhile, in the practice of human scoring, human raters will assign the score to students who use keyword synonyms. So we also build a corpus containing keywords and their frequently used synonyms for use in the experiment. The keywords scoring model built in this work is illustrated in Figure 2. It consists of three layers: Bert-Embedding layer, BiLSTM layer and Attention layer. They are described in detail as follows. (1) Bert-Embedding layer Pre-training models represented by BERT model combined with deep learning technology have been widely used in various research fields of natural language processing . The task of this layer is to represent each word or character with a dense vector . We use Bert model (Devlin et al., 2018) to convert the keywords into word vectors (e.g. e k N , e p N , e q N , etc.). Compared with Word2Vec (Grohe, 2020), Bert is mainly composed of a bidirectional transformer encoding structure, which can better learn contextual relation between words in a text. Based on the Bert model, we use the dimensions of 128, 256, 512 for comparison in the training process.
(2) BiLSTM layer For reference keywords and source keywords, we further extract their semantic feature vectors v p and v q in the BiLSTM layer. v p and v q are obtained by concatenating the last hidden state from both forward and reverse direction, e.g. concatenating [h p 1 , h p N ] to get v p . This is because the last hidden state can be considered as a summary of all vocabulary features (Xu et al., 2019). For test takers' responses, the input x k N of BiLSTM at time step N comes from three sources: v p , v q and the input word features of the test takers' responses (see Equation (1)). (

3) Attention layer
This layer assigns weights to the output feature vectors of the BiLSTM layer to highlight the features that play a key role in keywords scoring. Since the holistic score is numeric, we treat the scoring task as a regression task and use the mean square error (MSE) as a loss function when training the model, as shown in Equation (2).
where L m is the loss value for each batch during the training process, m is the size of each batch, y i is the predicted score of the model, andŷ i is the reference score of human raters.

Feature extraction and scoring of content
In the interpretation task, the test takers should not only translate the keywords correctly, but also should express the sentence meaning clearly. It is necessary to evaluate the sentence content. We refer the sentence to be translated as source sentence (Liu et al., 2020, April). Figure 3 illustrates the content scoring model based on Bert-BiLSTM-Attention. From Figure 3, we can see that the structure of the content scoring model is similar to that of the keywords scoring model, with two major differences: (1) In the content scoring model, the input could be not only keywords, but also word sequences in the sentence (CLS); (2) the sentence-level semantic features are extracted from the Bert model, not wordlevel semantic features. We incorporate the features of the reference sentence v p s and the source sentence v q s into the semantic features of the test takers' responses, and then output them to the BiLSTM layer.

Feature extraction and scoring of grammar
The word order, also known as grammar, plays a crucial role to ensure a high quality interpretation output. To evaluate the grammar of the test takers' responses, we use the method which combines syntactic analysis tree and neural network.
The parse tree represents the syntactic structure of a sentence in the form of a tree. The root node of the tree represents the label of the sentence, the branch node represents the label of the phrase, and the leaf node represents the part-of-speech (PoS) label of the word. For example, the source sentence is "越来越多的中国人开始享受一种时尚的运动型 生活方式", and the reference answer is "More Chinese people than ever are enjoying a fashionable sporting life style". We generate a parse tree of the reference answer based on Stanford (De et al., 2006) and Parser, and then sequentially arrange the terminal nodes to form the syntactic structure of the sentence (e.g. [RBJ, JJ, NNS, . . . , NN]), as shown in Figure 4.
With the parse tree, we build a grammar scoring model based on BiLSTM and Attention mechanism, as shown in Figure 5. The differences from the keywords scoring model are as   (2) In the embedding layer, we use a random initialisation technique for the vectorisation stage. The upper network is used to learn the relation between the grammar and the score.

Feature extraction and scoring of oral fluency
Oral fluency is an important indicator of a person's oral coherence and proficiency. Fluency is usually measured by the speech speed, pronunciation time ratio, and pause duration (Liuyan, 2015). We assess the test takers' oral fluency based on speech speed and define it as the average pronunciation duration of each word after removing the silent segments. The scoring process is shown in Figure 6.
As shown in Figure 6, based on short-term energy and zero-crossing rate, the number of words in the test taker's spoken response and the pronunciation duration (without the silent segments) for each word are obtained. The speech speed is then calculated using Equation (3). And the faster the speed is, the higher score the test taker gets.
where n is the number of words in the test taker's speech, i is the sequence number for each word, and pronounce_time i is the duration of the pronunciation after removing the silent segments.

Score fusion model
After getting the scores of the keywords, content, grammar and oral fluency, we need to fuse the four scores into one final score. We perform comparative experiments using the current mainstream regression models in machine learning: linear regression model and random forest regression model. In the linear regression model, the weight of the four feature parameters is set by human and the final score is the weighted sum of each feature. While in the random forest regression model, it uses the Bootstrap sampling (Babar et al., 2020), which generates k decision trees through multiple rounds of sampling, and finally averages the results of the k decision trees to get the final score. The experimental results indicate that the random forest regression model outperforms the linear regression model.

Data
The data in this paper comes from the exam data of the Chinese-English interpretation task in section A of Exam A and Exam B in the Guangdong Higher Education Examination Program for Self-study on 24 October 2015. There are a total of 10 questions. Table 1 shows the first 3 questions and one of their reference answers with keywords highlighted in bold. We collect a total of 2734 spoken responses with accurate manual scoring and labelling, all of which are within 20s in duration and are recorded live by the test takers in the real oral test. Each response is scored on a disperse scale of 0-2 with the interval of 0.5 by two human expert raters, and we take the average score as the reference score. The voice data should be first converted into text data. To accurately evaluate the scoring model, we first perform manual transcription on 2734 spoken responses. At the same time, the conjunction and preposition such as "a", "the", "and" are eliminated in order to better extract the semantic features of the responses.

Data processing method
Since the average number of responses for each question is only 273, and it is difficult to train the neural network model with such limited data. Therefore, inspired by the work of Lun et al.(2020), we employ a data augmentation strategy to enhance our training dataset. Formally, we define a piece of combined data as follows: a source sentence/a question(q), a test taker's answer(a), a reference answer(p) and a reference score(s). For each question, there are multiple reference answers. Figure 7 shows a toy example of data augmentation strategy. The approach is to match multiple reference answers to the test taker's response of a question. For example, if the first question has four reference answers, then one test taker's response can generate four pieces of data. Therefore, the final amount of data for the first question is 270 * 4 = 1080. In Figure 7, q is the question, p 1 , p 2 , p 3 and p 4 are the reference answers, a is the test taker's answer, and s is the reference score. Table 2 shows the amount of data after the data augmentation strategy.

Scoring rubrics
Based on the Syllabus and the advice provided by the experts, we set four proficiency levels of A, B, C and D. The scoring rubrics of the human raters are shown in Table 3. Based on Table 3, we set our scoring rubrics and the corresponding level for each score in the paper, which is shown in Table 4. And the proportion of the responses at different levels is listed in Table 5.

Response visualisation
We visualise the spoken responses in the training set by using t-distributed Stochastic Neighbor Embedding (t-SNE), a technique of dimensionality reduction for the visualisation Figure 7. A toy example of data augmentation.  The information unit is basically accurate and complete. 0 The information unit is completely wrong. Content 0.8 The sentence meaning is accurate. 0.4 The sentence meaning is basically accurate. 0 The sentence meaning is unclear and incoherent.

Grammar& pronunciation
Normally no score will be deducted The grammar is basically correct: 1. Verb tenses and voices; 2. Transitivity of verbs; 3. Subject-predicate agreement. Pronunciation and intonation are basically correct. Score deduction by information unit Serious grammatical errors have caused misunderstanding or incomprehension of the information: 1. Wrong wording; 2. Incomplete sentence structure and incoherent meaning. Score deduction by information unit.
of high-dimensional datasets (Van der Maaten & Hinton, 2008). Each response is first transcribed into a word sequence and then represented by a 256-dimensional vector, i.e. the average of the word embedding vectors obtained via Bert model. Figure 8 shows the visualisation of the responses labelled with different levels.
From Figure 8, we can see that for level A and B, the responses are highly clustered according to the question (Q1, Q2, . . . , Q5), whereas for level D, the responses appear to be randomly distributed. The reason of this distribution is that the responses of level A and B are excellent responses and closely related to the question, while the responses of level D are terrible responses that vary greatly and usually not related to the question. This observation motivates us to consider incorporating the questions(source sentences) into the scoring models. Table 4. Scoring rubrics used in this paper.

Score
Level Description 1.5 < = score < = 2.0 A The translation of key points is accurate, the language expression is clear and coherent, the syntactic structure is complete, the sentences express the meaning, and interpretation proficiency is excellent.
The translation of key points is basically accurate, the language expression is basically clear and coherent, the syntactic structure is basically complete, the sentences basically express the meaning, and the interpretation proficiency is good. 0.5 < = score < 1.0 C The translation of key points is relatively accurate, the language expression is not clear and coherent, the syntactic structure is incomplete, the sentence expressed is inaccurate, and the interpretation proficiency is average. 0 < = score < 0.5 D The translation of key points is inaccurate or the content is irrelevant to the topic, the language expression is not clear and not coherent, the syntactic structure is incomplete, the sentence expressed is inaccurate, and the interpretation proficiency is poor.

Evaluation metrics
This paper uses Pearson correlation coefficient (r, see Equation (4)) and consistency rate (c, see Equation (5)) (Wu et al., 2020) to evaluate the performance of the model built in the paper.
c consistency rate =

Number of responses with predicted level consistent with human assigned level
Total number of responses (5) whereX is the average value of the predicted score of the model, X i is the predicted score of each response;Ȳ is the average value of reference score of human raters, Y i is the reference score of each response.

Experimental results and analysis
As mentioned above, this paper develops a scoring model which assesses keywords, content and grammar based on BiLSTM-Attention and evaluates fluency based on speech speed, and then uses the random forest regression to generate the final score. To evaluate the effectiveness of our proposed model, we design the following comparative experiments: (1) Model 1: It uses the Siamese-LSTM (Liang et al., 2018) model which also acquires rich semantic features to score the keywords, content and grammar, and employs the random forest regression to generate the score. (2) Model 2: It uses the same methods for keywords, content, grammar and fluency scoring proposed in this paper, but uses the linear regression model to fuse the scores with weights set by human. The weight is set as follows: keywords (0.6), content (0.2), grammar (0.1) and fluency (0.1). (3) Model 3: To verify the effectiveness of concatenating question/source sentence information to the scoring model, we built a comparative model which only concatenates response and reference answer in keywords, content and grammar scoring, and employs the random forest regression model to generate the score.
The experimental results are shown in Tables 6-8. It can be seen from Table 6 that the model used in this paper is highest both in consistency rate and Pearson correlation coefficient, which indicates the effectiveness of our proposed model. And we can also find that model 4 has improved compared with model 3, verifying that concatenating question information to the scoring model will improve the model performance.
As for the environment construction, we select the python development environment and downloads karas, sklearn and other toolkits through Anaconda. And adjust the   pre-training weights of BERT. BERT's attention layer dropout probability is 0.1. Its activation function is gelu, hidden layer dropout probability is 0.1, hidden layer size is set to 256. The total number of parameters in this model is 19,389,218. The training epoch is 80 and batch size is 8. The learning rate is 0.001 and Table 7 shows the experimental results of 10 questions. Table 7 shows the performance of the proposed model in each question. From Table 7, we can know that both the consistency rate and correlation between human scoring and model scoring is high, which indicates that the scoring results of the model has high consistency with the human scoring. It can also be seen that the inter-human correlation is relatively low compared with the model-human correlation, this is because that the score of the human is disperse with interval of 0.5 while the score predicted by the model is continuous. Table 8 illustrates the importance of various scoring rubrics in random forest regression model. The feature importance is calculated with gini importance in the random forest structure. It can be seen from Table 8 that weights of content and keywords are of high importance the value of 0.3997 and 0.4353, while the importance of grammar and oral fluency are relatively low with the value of 0.1593 and 0.0057. The above experimental results show that the importance (weight value) of different scoring rubrics in the model scoring is consistent with the manual scoring standard. Both of them pay more attention to the keywords and content, less attention to the grammar and oral fluency, which is in line with the scoring rules of the exam.
However, different from the manual scoring standard, the scoring model in this paper pays more attention to the content. This result shows that the random forest regression model relies more on the content when making decisions to predict the score of the response. Since the random forest regressor will assign higher weight to feature with high accuracy rate, we conclude that: compared to the keywords scoring, the Bert-BiLSTM-Attention model performs better in content scoring.

Conclusion
This paper proposes an automatic scoring model for Chinese-English interpretation based on neural network. We use three attention-based BiLSTM models with different structures to learn the features of keywords, content and grammar, respectively. In the semantic feature vectorisation stage, the pre-training model Bert is employed for the keywords and content, and the random initialisation method is utilised for the grammar. At the same time, in order to improve the accuracy of the model, we integrate the reference answers and the source sentences into the test takers' responses, and then extract the features. In terms of pronunciation scoring, a fluency scoring method based on the speech speed is applied. The experimental results demonstrate that our proposed scoring model outperforms the other two baseline methods.
In the future, we will consider expanding the corpus data to further improve our model's performance. We will also consider incorporating more pronunciation-level features (e.g. rhythm, intonation, etc.) into the scoring model.