Self-aware cycle curriculum learning for multiple-choice reading comprehension

Multiple-choice reading comprehension task has recently attracted significant interest. The task provides several options for each question and requires the machine to select one of them as the correct answer. Current approaches normally leverage a pre-training and then fine-tuning procedure that treats data equally, ignoring the difficulty of training examples. To solve this issue, curriculum learning (CL) has shown its effectiveness in improving the performance of models. However, previous methods have two problems with curriculum learning. First, most methods are rule-based, not flexible enough, and usually suitable for specific tasks, such as machine translation. Second, these methods arrange data from easy to hard or from hard to easy and overlook the fact that human beings usually learn from easy to difficult, and from difficult to easy when they make comprehension reading tasks. In this article, we propose a novel Self-Aware Cycle Curriculum Learning (SACCL) approach which can evaluate data difficulty from the model’s perspective and train the model with cycle training strategy. The experiments show that the proposed approach achieves better performance on the C3 dataset than the baseline, which verifies the effectiveness of SACCL.


INTRODUCTION
Machine reading comprehension (MRC) (Min et al., 2020;Liu et al., 2018b;Peng et al., 2020;Yan et al., 2019;Peng et al., 2021;Nishida et al., 2019) is an important challenge in the field of natural language processing (NLP) and a basic task of textbased Question Answering, which is generally divided into four types (Liu et al., 2019a): cloze, multiplechoice, fragment extraction, and free generation. Among them, the multiple-choice MRC provides several options for each question, the machine needs to choose one of them as the correct answer, which has rich types such as commonsense reasoning and passage summarization. In addition, the answer may not appear in the original document, therefore the multiple-choice MRC is more challenging and requires a more in-depth understanding of the given document, question, and options.
Currently, the common practice for solving multiple-choice MRC tasks is to use pre-trained language models, and fine-tune them in a simple way. During training, all training examples are randomly processed, they are equal, which causes the model is not capable of learning the difficulty of the examples step by step. What's more, the difficulty previous CL approaches arrange data from easy to hard or from hard to easy. However, human beings usually learn from easy to difficult, and from difficult to easy in an iterative way when they make comprehension reading tasks.
In order to solve these problems, a novel Self-Aware Cycle Curriculum Learning (SACCL) approach is proposed to evaluate data difficulty from the model's perspective and to train data with the cycle training strategy. Specifically, the self-aware approach can increase the flexibility and scalability of the model, which judges the difficulty of examples by the model itself. The cycle training strategy (CTS) approach iteratively trains the data, which can allow the model to reading comprehension in a way that is close to humans.
The contributions of our article are as follows: 1. In order to avoid the limitations of rule-based design, we propose a self-aware approach, which judges the difficulty of examples by the model itself.
2. In order to be more suitable for human thinking habits and fully train the model, we use the cycle training strategy to arrange the data.
3. We empirically show that our SACCL approach is effective, and achieve better performance than baseline on the multiple-choice Chinese machine reading comprehension (C 3 ) dataset.

RELATED WORK MRC datasets
Each type of MRC task has a more typical dataset. CNN & Daily Mail, proposed by Hermann et al. (2015), is a cloze-style reading comprehension dataset created from news articles using heuristics and is a classic dataset in the field of MRC. RACE is a multiplechoice dataset, and it covers a wide range of fields. It contains more than 100,000 questions posed by experts, and it focuses more on reasoning skills (Lai et al., 2017). SQuAD is a span extraction dataset proposed by Rajpurkar et al. (2016), which limits the answer to continuous fragments in the original text. It promoted the development of machine reading comprehension. MS MARCO is a free answering dataset (Nguyen et al., 2016). It does not limit the answer to a fragment in the document. It requires the machine to have the ability to comprehensively understand multi-document information and aggregate it to generate the answer to the question, which is closer to the real world. Multiple-choice datasets provide a more accurate assessment of machine understanding of language, because questions and answers may come from human generalizations or summaries and may not appear directly in the document. Methods that rely only on information retrieval or word frequency cannot achieve good results. Sun et al. (2020) presents the first free-form multiple-choice Chinese machine reading comprehension dataset (C 3 ). Various question types exist such as linguistic, domain-specific, arithmetic, connotation, implication, scenario, cause-effect, part-whole, and precondition in this dataset. Therefore it requires more advanced reading skills for the machine to perform well on this task. This article conducts related experiments on the C 3 dataset.

Multiple-choice methods
Previous research on multiple-choice is diverse. Dai, Fu & Yang (2021) propose an MRC model incorporating multi-granularity semantic reasoning. The model fuses the global information with the local multi-granularity information and uses it to make an answer selection. Sun et al. (2022) extract contextualized knowledge to improve machine reading comprehension. However, these approaches add extra features to the model, without the ease of in-depth analysis of the existing data. Zhang et al. (2020b) introduce the context vector of the syntax-guided to parse the passage and question separately, and obtain finer vector representations of passage and question, so as to give more accurate attention signals and reduce the influence of noise brought by long sentences. However, it is required to build a specific parsing tree for passages and questions. Zhang et al. (2020a) consider the interaction between documents, questions, and options, and introduce a gated fusion mechanism to filter out useless information, but the study does not consider the difficulty of the samples themselves. In this article, we propose the SACCL to mine the own characteristics of existing samples, and rationally arrange and use them to achieve better results.

Curriculum learning
Extensive researches have been carried out, with the curriculum learning proposed in Bengio et al. (2009). It aims to facilitate the model training in a specific order, which leads to improved model performance (Hacohen & Weinshall, 2019;Xu et al., 2020). It has been applied to many fields, such as machine translation Zhang et al., 2018;Platanios et al., 2019,), image recognition (Büyüktas, Erdem & Erdem, 2020;Huang et al., 2020), data-to-text generation (Chang, Yeh & Demberg, 2021), reinforcement learning (Narvekar et al., 2020), information retrieval (Penha & Hauff, 2020), speech emotion recognition (Lotfian & Busso, 2019), emotion recognition (Yang et al., 2022), spelling error correction (Gan, Xu & Zan, 2021). Curriculum learning has brought different degrees of improvement to these fields. Kumar et al. (2019) adopt CL to reinforcement learning to optimize the model parameters. CL also has shown to be useful for data processing to improve the quality of the training data (Huang & Du, 2019). Liu et al. (2018a) proposed a CL-NAG framework that utilizes curriculum learning to improve data utilization. CL-NAG makes full use of both noisy and low-quality corpora. In addition to the fact that the data is arranged in order from easiest to hardest, there are also some cases (Zhang et al., 2018Kocmi & Bojar, 2017) where the data is arranged from difficult to easy with good results. With the wide application of deep learning in various fields, the use of CL to control the order of training data has received more and more attention.

METHOD
In this section, we introduce our method in three parts. First, the task description describes input data and output data. Second, we construct a multiple-choice MRC model. Third, our SACCL approach includes difficulty assessment and training strategy.

Task description
For multiple-choice reading comprehension, a document (denoted as D) and a question (denoted as Q), and a set of options (denoted as O) are given. Our task is to learn the predictive function F, which generates the answer (denoted as A) by receiving the document D and the related question Q and the options O, we define the task as follows: . . . ; d n (n indicates the total number of words in the document), the question Q ¼ q 1 ; q 2 ; . . . ; q m (m indicates the total number of words in the question), and the options where, u ¼ ð1; 2; 3; 4Þ, each option consists of K words. A belongs to one of the four options, which is shown in Eq. (1). This setup challenges us to understand and reason about both the question and document in order to make an inference about the answer.

Model
Following the implementation of BERT (Devlin et al., 2019), we concatenate the document, the question, and the option together as shown in Eq. (2), which is the input sequence. The input sequence is concatenated using two special tokens, < CLS > for the beginning, < SEP > for the end and the middle separator. The length of the input sequence is the sum of the document, the question, the option, and special tokens. We pass them to BERT and use a linear classifier on top of it to get the probability distribution. These are shown in Eqs. (3) and (4).
where the logits are the probability distribution of the option.

Our SACCL approach
Our SACCL (Self-Aware Cycle Curriculum Learning) approach consists of two important parts. Firstly, Difficulty Assessment judges the difficulty score of each sample. Secondly, Training Strategy arranges the order in which the samples are trained.

Difficulty assessment
As illustrated in Section Introduction, in order to avoid the limitations of rule-based design, and increase the scalability of the model, we design a self-aware approach, which independently judges the difficulty score of the samples by the model itself, and then sorts the samples according to the difficulty scores. The whole process consists of two parts.
Firstly, the training model divides the original training data into six blocks, each block is trained with a model. Secondly, the trained models are used to score and sort the other samples, as shown in Fig. 1. Let R be the training examples set, r k is the k-th example in R, difficulty assessment is to calculate the difficulty score s k of r k according to a certain evaluation standard, let S be the whole difficulty score set corresponding to R.
In order to evaluate the difficulty, we first randomly divide the training set R into six blocks according to the size of the dataset, denoted as É . Then we train six corresponding models M 0 i : i ¼ 1; 2; . . . ; 6 È É on them, each model M 0 i uses R 0 j as the training set, and the remaining five pieces of data are used as the validation set to obtain the difficulty score. We also try to split the data into seven or five blocks, but the experiment works best when the data is divided into six blocks, so we choose to split the data into six blocks.
Specifically after training, M 0 i predicts examples R 0 j , where j 6 ¼ i. F is the metric calculation formula, correct prediction is 1, error prediction is 0, as: The scores of each example are counted after all predictions are over, as: The examples are sorted according to s k . If s k ¼ 5 indicates that all five models predict correctly. This type of example is the simplest and it is ranked first, and so on. If s k ¼ 0 indicates that all five models predict incorrectly. This type of example is the hardest and it is ranked last. Training strategy In order to train the data in an iterative manner, from easy to hard, and from hard to easy, which is more suitable for human thinking habits, we use the circular training strategy to arrange the data, which is called the cycle training strategy (CTS). Our method arranges training examples R into cycle training strategy according to their difficulty scores S obtained in the previous section. After all samples are sorted according to s k , they are divided into M buckets: M buckets: B t : t ¼ 1; 2; . . . ; M f g . In this article, we take M as 6. B 1 is the easiest, and B 6 is the hardest.
We design our training strategy in a multi-stage setting T u : u ¼ 1; 2; . . . ; N f g . N indicates the count of epoch. In the first stage, only B 1 is added for training. From the second stage, one B t is added each time until all the data are added to the training set. After training an epoch, the last bucket added to the training set is subtracted. One bucket is subtracted from each round until only B 1 is left. The data is shuffled for training at each stage. The progress of the training strategy is given in Algorithm 1.

EXPERIMENTS Datasets
To verify the effectiveness of SACCL on multiple-choice tasks, the multiple-choice Chinese machine reading comprehension dataset (C 3 ) is considered. This dataset contains 13,369 documents collected from questions in the general domain of the Chinese Proficiency Test and 19,577 multiple-choice free-form questions associated with these documents. In this dataset, there are 11,869 questions in the training set, 3,816 questions in the validation set, and 3,892 questions in the test set. The documents contain conversational form documents and non-dialogical documents with mixed topics (e.g., stories, news reports, monologues, or advertisements), which requires models to have stronger reasoning capabilities. C 3 tasks can be classified into C 3 -Dialogue ðC 3 D Þ and C 3 -Mixed ðC 3 M Þ based on the two types of documents. Also, 86.8% of the questions in this dataset require a combination of internal and external knowledge of the document (general world knowledge) to better understand the given text.

Evaluation metric
Exact Match (EM), Precision, Recall, and macro-F1 (Rajpurkar et al., 2016) are used to evaluate the model performance. Exact Match measures the proportion of the correct results (including positive and negative cases) predicted by the model. A higher value of EM means that the model answers more questions correctly. Precision is the precision rate, indicating the proportion of the number of correctly predicted samples in all the samples whose prediction results are positive examples, and Recall is the recall rate, indicating the percentage of all the samples with positive results that are correctly predicted. F1 is used to measure the repetition rate of the prediction compared to the ground truth answer.

Experimental settings
Our model is built using the PyTorch deep learning framework. We further pre-train with this model on one 3,090 GPU. We adjust the parameters in our model to what is shown in Table 2. In order to prevent model overfitting and excessive training time, the validation set is tested every round during the training phase of the model.

Experimental results
To evaluate the SACCL approach, the BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b), and ALBERT (Lan et al., 2020) are used as the baseline. BERT is a bidirectional transformer (Vaswani et al., 2017) network. It uses the Encoder module in the transformer architecture and abandons the Decoder module, so that it automatically has bidirectional encoding capabilities and powerful feature extraction capabilities. BERT introduces masked language modeling (MLM) and next sentence prediction (NSP) tasks to train on plain text. RoBERTa is a finer-tuned version of the BERT model. The RoBERTa model removes the NSP task, uses a dynamic mask strategy, and trains with larger batch size and learning rate. ALBERT is a Lite BERT. It introduces parameter reduction technology, which significantly reduces the number of parameters of BERT without significantly compromising its performance, thereby improving parameter efficiency. The base pretrained language models are used in the main experiments. We use BERT-base to reimplementation the article  and get better results on the test set.

Main comparison
In order to illustrate the effectiveness of our method, we first arrange the training data in the way of self-aware curriculum learning (SCL); that is, arranging the data from easy B 1 to hard B 6 , and add a bucket each time, until all the data are added to the training set, and then train for several epochs until convergence. From Table 3, we can see that the SCL approach achieves 0.99% gain on EM over the results in the article  on C 3 -test. It can achieve improvements of 0.28% gain on EM over the baseline RoBERTa model on the C 3 -test, and achieve improvements of 1.92% gain on EM over the baseline ALBERT model on the C 3 -test, which proves that giving the model the easiest data at the beginning of training and then gradually increasing the difficulty of the data will get better results than randomly arranging the data. In order to verify the effect of the CTS (Cycle Training Strategy) approach, we arrange the data in an iterative manner in Fig. 2. As shown in Table 3, the CTS approach achieves 1.1% gain on EM over the results in the article ) on the C 3 -test. It can achieve improvements of 0.56% gain on EM over the baseline RoBERTa model on the C 3 -test, and achieve improvements of 1.92% gain on EM over the baseline ALBERT model on the C 3 -test, which proves that from easy to difficult, and then from difficult to easy, this repeated arrangement of data can get better results than SCL approach. In summary, the decent performance on the benchmark dataset can validate the effectiveness of the proposed CTS. Warm-up proportion 0.1

Optimization function Adam
Comparison of self-aware approach and others To better understand the SACCL approach, a comparison study of the self-aware approach and others is performed. First, the empirical-based approach evaluates sample difficulty based on human observation. This method takes into account the types of questions . Second, the document-length-based approach evaluates sample difficulty based on document length. As for the empirical-based approach, we divide the training dataset into six buckets based on the types of questions. See Table A.1 for the types of questions. The results are  shown in Table 4. CL and CTS experiments have not achieved better results than the self-aware approach. In the development set, the EM value of the self-aware method reaches 65.85, while the EM value of the empirical-based method is only 63.81. This indicates that the empirical-based approach does not work for all tasks. Because everyone's knowledge is different, for the same question, people with relevant knowledge think the question is easy, and people without relevant knowledge think it is difficult, so this order may not really be from easy to difficult. We also counted the data distribution between each bucket and found that the data set divided by the self-aware approach is more balanced than the data set divided by empirical analysis, as shown in Table 5. As for the second approach, we divide the training dataset into six buckets based on the document length. Bucket1 has the shortest data and bucket6 has the longest data. The results are shown in Table 6. The CL method is lower than the baseline on both the validation set and the test set, but the CTS method is higher than the baseline on both the validation set and the test set. The data distribution of this method is shown in Table 5. In both methods, the results of CTS are higher than CL, which can be proved the CTS approach effectiveness.

Document length analysis
The maximum text length of the BERT model is 512. According to the analysis in Table 7, it is known that the ratio of C 3 training dataset document length exceeding 512 is 14.2%. When we are dealing with multiple-choice tasks, the sum of the document length, question length, and option length exceeds 512, we select the document for additional processing. Because, generally speaking, the document length is the longest, or when the question length exceeds the document length, the total length will not exceed 512. Following (Sun et al., 2019), three strategies are chosen to truncate the document length. The first is deleted from the tail. The second is deleted from the head. And the third is deleted from the middle (we select the first 128 and the last 380 tokens). Through experimental analysis, it is found that most of the key information of long text comes from the head of the document in the C 3 dataset, as shown in Table 8. Therefore the experiments are based on this setting in this article.

CONCLUSION
In this article, we present a Self-Aware Cycle Curriculum Learning (SACCL) approach for multiple-choice reading comprehension, which can judge the difficulty of the samples by the model itself, and learn in a loop like a human, that is from easy to hard and from hard to easy. The proposed SACCL is very effective and outperforms the baseline model. We also experiment with some other rule-based approaches and show interesting results, which demonstrate the effectiveness of our method. For future work, we consider transferring this method to other tasks such as machine translation, text classification, etc. to validate the robustness of our SACCL approach.

A FURTHER EXPLANATION OF EMPIRICAL-BASED METHOD
The empirical-based approach is entirely manual-based. In Table A.1, the type of question is 'matching', and the answer can be obtained from this sentence 'The gas price will be changed again tomorrow' in the text. This type of question is usually simple. In Table A.2, the question type is 'linguistic'. This type of question usually requires understanding the whole document, which is more difficult than the 'matching' type. In Table A.3, the question type is 'domain specific'. The question is: What season it is now? From the external knowledge, we can obtain that fresh 'Red Fuji' apples mature around September in China.
And the correct answer is spring. Therefore, we argue that this type of question is usually more difficult than the 'linguistic' type. Bold in the options indicates the correct answer.