Next Article in Journal
Relaxation of Some Confusions about Confounders
Previous Article in Journal
Falkner–Skan Flow with Stream-Wise Pressure Gradient and Transfer of Mass over a Dynamic Wall
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Hop Question Generation Using Hierarchical Encoding-Decoding and Context Switch Mechanism

1
School of Transportation and Civil Engineering, Nantong University, Nantong 226019, China
2
Science Foundation Ireland Centre for Research Training in Machine Learning, School of Computing, Dublin City University, Dublin 9, Ireland
3
Alibaba Group, Hangzhou 311121, China
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(11), 1449; https://doi.org/10.3390/e23111449
Submission received: 11 October 2021 / Revised: 28 October 2021 / Accepted: 29 October 2021 / Published: 31 October 2021
(This article belongs to the Section Multidisciplinary Applications)

Abstract

:
Neural auto-regressive sequence-to-sequence models have been dominant in text generation tasks, especially the question generation task. However, neural generation models suffer from the global and local semantic semantic drift problems. Hence, we propose the hierarchical encoding–decoding mechanism that aims at encoding rich structure information of the input passages and reducing the variance in the decoding phase. In the encoder, we hierarchically encode the input passages according to its structure at four granularity-levels: [word, chunk, sentence, document]-level. Second, we progressively select the context vector from the document-level representations to the word-level representations at each decoding time step. At each time-step in the decoding phase, we progressively select the context vector from the document-level representations to word-level. We also propose the context switch mechanism that enables the decoder to use the context vector from the last step when generating the current word at each time-step.It provides a means of improving the stability of the text generation process during the decoding phase when generating a set of consecutive words. Additionally, we inject syntactic parsing knowledge to enrich the word representations. Experimental results show that our proposed model substantially improves the performance and outperforms previous baselines according to both automatic and human evaluation. Besides, we implement a deep and comprehensive analysis of generated questions based on their types.

1. Introduction

Question generation (QG) aims to generate appropriate questions for the given passages, it is an important task in natural language processing (NLP) research, QG has many applications for various NLP tasks. For example, QG can be used to augment a question answering (QA) dataset that is expensive to obtain, construct a synthetic QA dataset and facilitate a dialogue system by controlling conversation flow through generating questions. Besides, QG can be used for an educational purpose as it can improve and enhance children’s comprehension and retention by proposing questions based on textbook passages [1,2,3,4]. Especially, in the QG research community, multi-hop QG has recently been the focus of its potential applications in understanding complex human questions generated through the compositionality of questions, and the goal of multi-hop QG is to generate complex questions that require evidence across multiple passages to be answered [5].
QG has attracted researchers’ interests for many years. In the early years, rule/template-based methods were the mainstream models for the QG task. For example, a rule-based approach was proposed to transform a declarative sentence into its interrogative counterparts, and a statistical ranker was then invoked to select the most appropriate questions and discard those of low quality [6]. However, rule/template-based methods can only generate trivia questions by simply reordering clauses and manipulating words in the sentence, while it cannot handle complicated sentences. Since natural language is highly flexible, there are scenarios that rule/template-based approaches fail to process. Meanwhile, it is also difficult to accurately parse a sentence and obtain its constituents. To overcome such shortcomings, vector-based machine learning models have been introduced into QG tasks with the advent of the neural sequence-to-sequence (seq2seq) framework. This adds the advantages of modeling semantics of natural language in vector space and producing more fluent and human-like text [7]. After the deployment of neural networks in the QG tasks, various models were proposed and the quality of generated questions has been significantly improved, especially in terms of readability [8].
Despite the successful application in QG tasks, neural models still bear limitations and remain prone to generating irrelevant questions, particularly when producing complex questions according to multiple relevant passages. Usually denoted as semantic drift [9], such problems in QG can be categorized in two classes: global or local. In regard to the global semantic drift problem, a generated question might be grammatically correct, but its overall semantic meaning is irrelevant to the input passages and/or the answer. For example, given a set of passages about Isaac Newton together with the answer about the date when he was born, a neural QG may generate “When did Isaac Newton write the book Philosophiæ Naturalis Principia Mathematica?”, or even “Who wrote the book Philosophiæ Naturalis Principia Mathematica?”. Such generated questions are fluent and meaningful, but mismatch the answer and passages. Meanwhile, local semantic drift indicates that the semantic units (i.e., phrases or words) in the generated question are inconsistent with each other, resulting in the failure of forming a meaningful sentence. In this case, with the previous passages and the answer, the model-generated question might be “In the time of Dark Ages, who helped Isaac Newton invent the first electronic computer?”, where “Dark Ages”, “Isaac Newton” and “electronic computer” are neither compatible nor consistent.
To address the two aforementioned semantic drift problems, we propose two separate mechanisms: a hierarchical encoding–decoding mechanism and a context switch mechanism, which, respectively, seek to alleviate the global and local semantic drift problems. Inspired by the fact that the structural information on different granularity-levels has been proved to be helpful for encoding rich semantic information [10,11,12], we think a hierarchical structure is suitable for taking the advantage of the structural information. Following the typical seq2Seq framework, the hierarchical encoding–decoding mechanism also consist of an encoder and a decoder, where the former receives the input textual passages and encodes their structural information, and the latter receives the encoded information from the encoder and decodes the question in a coarse-to-fine fashion through the computation of attention weights. In detail, four various levels of granularity are involved during the encoding phase, including word-level, chunk-level, sentence-level and document-level, and the encoder will encode the textual passages based on their structure with these granularity-levels. Subsequently, the decoder will select the context vector in a coarse-to-fine fashion during the decoding phase: first selecting it on the document-level, then at the sentence-level and chunk-level, and finally on the word-level. Additionally, since the decoder generates words one by one in the decoding phase, we think the generated words in the same semantic unit (e.g., a phrase or an entity) should be more consistent and semantically related if they have similar context vectors. Thereby, we propose the context switch mechanism to provide similar context vectors when the QG model is expected to produce words in the same semantic unit. In the implementation of the context switch mechanism, an extra layer is included to output the probability at each decoding time-step, for the purpose of effectively using the context vector from the last step.
We then assess the performances of the proposed model and other baseline QG models by evaluating the results on the benchmark dataset HotpotQA [5]. Prevailing automatic evaluation metrics like such as BLEU [13], ROUGE [14] and METEOR [15] are employed, and we further conduct a human evaluation experiment since these automatic evaluation methods have been proven to have poor correlation with human [16]. The experimental results show the proposed model can improve the quality of generated questions according to both automatic and human evaluation.

2. Related Work and Background

2.1. Question Generation

The question generation task has been explored broadly in the early natural language processing work where it mainly focused on rule-based approaches using heuristics induced from linguistic knowledge (such as dependency parsing and constituency parsing) to manipulate constituents in the sentence to produce an interrogative sentence. For example, a rule-based framework that utilizes heuristics from syntactic knowledge is proposed to transform declarative sentences to corresponding questions as candidates [6]. A statistical ranking model is then employed to score the candidates, and those of low-quality will be discarded.
Thereafter, the neural seq2seq model becomes dominant in the QG task and has achieved high performances [17], because of the successful application of neural models in other text generation tasks (e.g., machine translation and question answering). An attention-based bidirectional long short-term memory (LSTM) model is employed to generate questions given a pair of passage and answer [8]. In order to produce questions that are relevant to the corresponding answers, Sun et al. [18] propose to incorporate the point-generator network [19] and the word embedding of an textual answer. Likewise, Ma et al. [20] propose a QG model that can strength the connections of passages, answers, and questions by matching the sentence-level semantics and predicting the answer position in the passage. Chen et al. [21] adopt a reinforcement learning approach to directly optimize the QG model according to discrete evaluation metrics for the purpose of bridging the gap among the training objective, the word-level optimization, the inference aim, and the generation of a sentence-level output. With the help of advanced linguistic parsers such as dependency parsing, semantic role labeling (SRL) and named entity recognition, Dhole and Manning [22] leverage templates to generate questions based on the parsing results, including a dependency tree and SRL frames. The proposed approach achieves state-of-the-art results on the SQuAD dataset [23], outperforming the previously proposed neural QG model, showing that QG can benefit from the incorporation of linguistic and syntactic knowledge.
There are some other works exploring different aspects of the QG task, such as incorporating question types [24], encoding wider context information [25] as well as the combination of QA and QG [26].

2.2. Multi-Hop Question Generation

The Multi-hop QG task has its own complexity since complex questions are generated from multiple interconnected input passages [5]. Gupta et al. [27] introduces reinforcement learning and multitask learning into multi-hop QG, which specifically treats answer position prediction and supporting facts prediction as two extra tasks in the seq2seq training process. Their experimental results show that the proposed approach achieves high performance compared to baseline models.
Pan et al. [28] use the semantic units parsed from semantic role labeling and dependency parsing to construct a semantic graph for documents in order to model the connections among the semantic units as well as the documents usually neglected in prior arts. Then, a recurrent neural network (RNN) encoder and a graph neural network (GNN) encoder are invoked to encode the documents. The representations generated by the RNN encoder show a document’s basic textual information while the representations from the GNN encoder are expected to contain semantic information enhanced by the graph structure induced by semantic role labeling and dependency parsing. Next, an attention-based decoder is used to generate the question word by word. The proposed semantic graph model outperforms previous work by a large margin.
Furthermore, Xie et al. [29] explore how question-specific rewards used in reinforcement learning relate to the quality of questions for multi-hop QG, three question-specific rewards—fluency, relevancy, and answerability—are proposed. From the perspective of human evaluation, the findings from the experimental results suggest that directly optimizing relevancy yields improvements on the question quality; however, optimizing the other two rewards—fluency and answerability—results in quality degradation of questions, especially for answerability.

2.3. Evaluation of Question Generation

Previous work on question generation mostly uses discrete metrics (BLEU, ROUGE, and METEOR) from general text generation tasks (such as machine translation and text summarization). Nevertheless, those metrics have been shown to have flaws in evaluating text generation tasks. The findings from Reiter [16] support the usage of BLEU in machine translation, but BLEU is not suitable for other text generation tasks especially when evaluating individual texts. Accordingly, evaluating question generation with such discrete metrics is inappropriate since there is only one reference question for each generated question resulting from the common practice of using a QA dataset for the QG task. Moreover, there may be multiple appropriate questions for the input passages and answer. Thus, metrics such as BLEU and ROUGE evaluating lexical similarity are not suitable for the question generation task.
Human evaluation is also widely applied in the assessment of QG models. A common practice is to randomly sample a few hundred generated questions and to ask human raters to evaluate them by different dimensions (i.e., adequacy and fluency) on a five-point scale [17]. The final result of human evaluation is reported as the ranking of models by their average rating scores.

2.4. Seq2seq Generation Model and Attention-Based Decoder

In this section, we introduce the basic structure of the seq2seq text generation model [7] and the attention-based decoder [30].

RNN-Based Seq2Seq Model

Given a source sequence X = { x 1 , x 2 , , x n } , a seq2seq text generation model is expected to generate a target sequence Y = { y 1 , y 2 , , y m } where x and y are the tokens in sequences X and Y, respectively. A seq2seq model usually coheres with the encoder-decoder structure, where the encoder firstly receives the source sequence X as the input and produces representations of X, and the decoder can then generate the target sequence Y token by token using the previously produced representations of X.
A typical implementation of a RNN-based seq2seq model uses a RNN encoder and a RNN decoder to constitute its encoder–decoder structure. With the source sequence X that has n tokens, we feed its tokens one by one into the RNN encoder:
h t = g ( h t 1 , x t )
where h t is the encoded representation for the hidden state at a time-step t and x t is the t-th token in the sequence X. Following the encoding phase, the decoder takes the last result provided by the encoder for the hidden state h n as the first decoder result for hidden states and generates each decoded hidden state one-by-one:
s t = f ( s t 1 , y t 1 )
where s t is the decoder result for hidden states at a time-step t, specifically s 0 = h n , y t 1 is the t 1 -th target token. To generate the target token at the t time-step, we use the decoder result for hidden state s t to obtain a probability distribution over the vocabulary. Thereafter, we select the one with the highest probability:
y t = s o f t m a x ( s t , y t 1 )
It is noteworthy that in the training process, it is common to adopt the teacher-forcing mechanism [31] in which we directly input the ground-truth target token y t 1 at the time-step t during the decoding phase instead of the last predicted token in order to stabilize the training process. In the inference process, we always input the predicted target token y t 1 since the ground-truth target token y t 1 is not available.

2.5. Attention-Based Decoder

In the vanilla RNN encoder–decoder structure, the encoder and decoder are independent as the only connection between them is that the latter uses the last hidden states from the former to initialize its own hidden states. The information of the source sequence unfortunately lack full utilization. Hence, Bahdanau et al. [30] propose an attention mechanism to enable the decoder to select the part of the source sequence on which to focus when generating target token. Concretely, an extra term c called context vector is added into Equation (2):
s t = f ( s t 1 , y t 1 , c t )
where c t is the context vector at the time-step t, which is computed using the combination of all encoder resulting hidden states:
c t = i = 1 n e i , t h i
where e i , t is the normalized coefficient for decoder resulting hidden state s t and encoder resulting hidden states h i . The equation for computing e i , t is:
e i , t = ϕ ( s t , h i ) j = 1 n ϕ ( s t , h j )
where ϕ is the scoring function measuring the connection between s t and h i .

3. Model Architecture

Our model is a bidirectional gated recurrent unit (GRU) [32,33] based RNN consisting of an encoder and a decoder. Given a set of documents D = { w 1 , w 2 , , w v } and the answer A n s = { a 1 , a 2 , , a u } , our model receives the concatenation of D and A n s as the input, where w is a word in D, a is a word in A n s , and the input contains n words ( n = v + u ). Besides, we record the hierarchical information of the input using a document-sentence-chunk-word structure. In detail, D can be described as a combination of documents D = { d o c 1 , d o c 2 , } , where each document d o c is d o c = { s e n t 1 , s e n t 2 , } , each sentence s e n t is s e n t = { c h u n k 1 , c h u n k 2 , } and each chunk c h u n k is c h u n k = { w o r d 1 , w o r d 2 , } .
Figure 1 provides the overall architecture our proposed model and describes the generation process at the time-step t = 1 . The encoder firstly encodes the input documents to obtain its sequential representation H s e q and injects dependency parsing into H s e q to get the dependency representation H d e p , we therewith fuse H s e q and H d e p to form the word-level representation H w o r d . Afterwards, we can successively get the chunk-level, sentence-level and document-level representation ( H c h u n k , H s e n t and H d o c ) according to the word-level representation H w o r d , and the details will be introduced in Section 3.1. Hence, we have representations of documents on four granularity-levels: H w o r d R n × h , H c h u n k R n u m _ c × h , H s e n t R n u m _ s × h , H d o c R n u m _ d × h , where h means the length of each word representation, n, num_c, num_s and num_d represent the number of words, chunks, sentences and documents, respectively.
In the attention mechanism of decoding phase, we select the context vector in a coarse-to-fine way at each time-step. Specifically, we select the document-level context c t d o c at first, and it can be used to guide the selection of sentence-level context c t s e n t . Then, both c t d o c and c t s e n t can help to select the chunk-level context c t c h u n k . At the end, we incorporate these three context vectors to select the word-level context c t w o r d . Finally, we fuse these four context vectors to obtain a context vector c t at the time-step t, and it will be used to generate a word y t belonging to the vocabulary.

3.1. Encoder

The encoder first uses a bidirectional GRU network to encode the concatenated input texts [ D , A n s ] to obtain its sequential representation which is denoted as H s e q R n × h , and we use the last hidden states of answer tokens as its representation. Next, we will inject the dependency parsing information into H s e q to obtain the dependency representation H d e p . Each word in a parsing tree has an ancestor node, and some words may have a child node, which means each word in a parsing tree has at least one edge connecting to another word, and such edge information can be used to incorporate with the dependency parsing information. Similar to a graph neural network (GNN), we encode such information as follows:
w i = i = 1 k M k w i k
w i = g ( w i , w i )
where w i k is the word representation of the k-th word in w i ’s neighborhood words, M k is the transformation matrix of corresponding edge type connecting w i and w k . Then, the function g updates the current word representation of w i using w i . We repeat Equations (7) and (8) for T turns to enable better message passing through word representations, where T is a hyper-parameter.
After the injection, we can generate a set of new word representations called the dependency representation H d e p R n × h . We then fuse H s e q and H d e p together to form a new sequential representation H w o r d by concatenation. Using the word-level representation H w o r d , we can obtain other structural representations of the input [ D , A n s ] according to its alignment matrices:
H c h u n k = σ ( A c h u n k T H w o r d )
H s e n t = σ ( A s e n t T H c h u n k )
H d o c = σ ( A d o c T H s e n t )
where A c h u n k R n × h , A s e n t R n u m _ c × h , A d o c R n u m _ s × h are the chunk-level, sentence-level and document-level alignment matrices, respectively. Each entry in an alignment matrix A is either 1 or 0, which represents whether a column in the representation H should be included in the current chunk/sentence/document or not. For example, an entry A c h u n k i j A c h u n k can indicate whether the j-th word in H w o r d should be included in the i-th chunk ( A c h u n k i j = 1 ) or not ( A c h u n k i j = 0 ), where i and j are the row and the column the entry located in.

3.2. Decoder

Following a typical auto-regressive setup, our model can compute the context vector c t through an attention function with the current hidden states h t to generate a word at a time. Specifically, we combine the last hidden states resulting from the encoder to form the initial hidden states s 0 . Different from a vanilla attention-based auto-regressive decoder described in Section 2.5, our decoder is equipped with a hierarchical attention function in which the context vectors are generated in accordance with the coarse-to-fine fashion (from document-level to word-level). Concretely, at a time-step t during the decoding phase, the context vectors in various levels of granularity are generated as follows:
c t d o c = a t t e n t i o n ( s t 1 , H d o c )
c t s e n t = a t t e n t i o n ( [ s t 1 , c d o c ] , H s e n t )
c t c h u n k = a t t e n t i o n ( [ s t 1 , c d o c , c s e n t ] , H c h u n k )
c t w o r d = a t t e n t i o n ( [ s t 1 , c d o c , c s e n t , c w o r d ] , H w o r d )
where the s t 1 is the hidden states resulting from the decoder at the time-step t 1 , [ s t 1 , c d o c ] is the concatenation operation to combine s t 1 and c d o c together, and the a t t e n t i o n function follows Equations (4)–(6). Then, we use the f u s e function to obtain the final context vector c t at the time-step t using these four computed context vectors c t d o c , c t s e n t , c t c h u n k and c t w o r d :
c t = f u s e ( c t d o c , c t s e n t , c t c h u n k , c t w o r d )
Finally, we generate the decoder hidden states s t based on the embedding of the last word y t 1 , the context vector c t and the previous hidden states s t 1 , then the word y t at the time-step t is generated based on s t and the previous word y t 1 :
s t = f ( y t 1 , c t , s t 1 )
y t = s o f t m a x ( y t 1 , s t )

3.3. The Context Switch Mechanism

Moreover, in order to increase the stability of the decoding process, we add the context switch mechanism that enables sharing similar contexts through a set of words when they are generated consecutively. Figure 2 represents the structure of the context switch mechanism as well as the working process at the time-step t.
For the implementation of this mechanism, an extra linear layer is included to produce a probability p s w i t c h as the indication of the selection between using the previous context vector c t 1 and keeping the current one c t . The probability p s w i t c h is computed by the following equation:
p s w i t c h = ψ ( s t , c t , c t 1 )
where the function ψ uses c t 1 , c t and the current hidden states s t from the decoder to produce the probability. For p s w i t c h α , the c t will be replaced by c t 1 in Equation (17); otherwise, c t remains, where α is the value of threshold we predefined. In practice, α is set to 0.5, as we think activating p s w i t c h or not should be equiprobable.

3.4. Training Objective

Generally speaking, the training objective of a seq2seq model is to maximize the probability of the target sequence Y = { y 1 , y 2 , , y m } when given the source sequence X = { w 1 , w 2 , , w n } , as described in Equation (20):
P ( Y | X ) = P ( y 1 , y 2 , , y m | w 1 , w 2 , , w n ) = t = 1 m P ( y t | H , y 1 , y 2 , , y t 1 )
where H is the representation of the source sequence X and y t is conditioned on the generated tokens before time-step t. To maximize the probability P ( Y | X ) , we train our model using negative log likelihood loss (NLLLoss) for the generation objective:
L ( θ ) = 1 m t = 0 m 1 l o g p ( y t | H , y < t ; θ )
where θ represents the parameters of our model. We employ Adam [34] optimizer to optimize the parameters with c.

4. Experiments

4.1. Data Reparation

In this paper, we conduct experiments on HotpotQA [5], which is a multi-hop question and answering dataset (https://hotpotqa.github.io (accessed on 15 August 2021)). The terminology multi-hop means it requires a QA model to reason over multiple passages and grab corresponding information to answer the questions in the HotpotQA dataset. For the usage of the HotpotQA dataset in the QG task, the QG model will take an answer as well as its related passages to generate a question. The original HotpotQA dataset consists of p a s s a g e , a n s w e r , q u e s t i o n tuples, and is split into two sets for training (90564 samples) and testing (7405 samples), respectively. We extract the annotated supporting facts sentences in passages rather than the whole passages as the input to our model. To obtain the dependency trees and constituency trees (chunk-level information) of the documents in train and test set, we employ the AllenNLP [35] dependency parser (https://demo.allennlp.org/dependency-parsing (accessed on 10 September 2021)) and constituency parser (https://demo.allennlp.org/constituency-parsing) (accessed on 10 September 2021).

4.2. Training and Inference Setup

The detailed hyper-parameters for training our model are selected as follows: (1) the learning rate is 7.5 × 10−4; (2) the weight decay rate is 0; (3) the batch size is 32; (4) the dropout rate is 0.4; and (5) the maximum gradient norm is 5. We employ the global vectors for word representation (GloVe) [36], where the dimension of word embedding is 300, both the encoder hidden size and decoder hidden size are set to 768. Furthermore, the number of turns T for injecting the dependency information are set to 3. During the inference phase, we input the test set into the trained hierarchical encoding–decoding model while the size of beam search is set to 5.

4.3. Evaluation

4.3.1. Evaluated Models

To analyze the performance of our proposed model and the quality of the generated questions, we will compare the performance with baseline models. Six models are involved for the comparisons, as described as follows:
  • Our model-1: Our proposed hierarchical encoding-decoding QG model;
  • Our model-2: The proposed QG model integrated with a larger dictionary that mitigates all unknown tokens;
  • Semantic-Graph: A framework that contains semantic graphs and an encoder using an attention-based gated graph neural network [28];
  • Semantic-Graph*: Semantic-Graph with the context switch mechanism;
  • RNN: A vanilla RNN-based seq2seq model;
  • GPT-2: A large transformer-based language model [37].

4.3.2. Automatic Evaluation

We use the following prevalent evaluation metrics to automatically assess the performances of question generation models:
  • BLEU-N: A method that measures the precision based on the n-gram overlap between generated questions and references [13]. We compute BLEU-[1,2,3,4] in this experiment.
  • ROUGE-L: ROUGE-L is a method that measures precision and recall on the longest common subsequence (LCS) overlap between system outputs and references [38].
  • METEOR: METEOR uses a set of stages (e.g., word stemming, synonyms, etc.) to generate the mappings of unigrams between system outputs and references, and compute the weighted harmonic mean of precision and recall based on the mappings [15]. Recall has a higher weight than precision.
Table 1 represents the metric scores of these QG models. Compared to the baselines, we find that our proposed QG model with a larger dictionary (Our model-2) outperforms other models according to ROUGE-L, the proposed model with or without the dictionary can outperform the current state-of-the-art model Semantic-graph on ROUGE-L. In particular, our model outperforms the large pre-trained language model GPT-2 on both ROUGE-L and METEOR. We also find Semantic-Graph* has the highest METEOR and BLEU-1 score, which clearly proves the effectiveness of our proposed context switch mechanism that has been applied in the Semantic Graph model. However, the GPT-2 models have the best performance on BLEU-[2,3,4].

4.3.3. Human Evaluation

Since the popular automatic metrics appear to not agree with each other, we additionally examine the human evaluation as a means of further investigating the performances of these QG models. We conduct the crowd-sourcing experiment on Amazon Mechanical Turk (https://www.mturk.com/ accessed on 11 October 2021), and we ask human workers to evaluate the performances of seven models, including the previous six QG models and an extra model, Gold, for which the outputs consist of the reference questions.
For the judgment of a single generated question, a P A Q tuple p , a , q ( p = paragraph, a = answer, q = question) will be shown to a human rater, and the rater is asked to judge the quality of the question according to four aspects: fluency, relevance, answerability, and complexity. In our experiment, each human rater is assigned with 15 P A Q tuples, and these questions are randomly selected from the outputs of the seven systems. We employ a 7-point rating scale (0–6) for every aspect that can be construed as: very bad, bad, fairly bad, indifferent, fairly good, good, and very good. We have involved 188 human raters comprising a total of 2820 evaluated outputs, and on average, a model is expected to have about 400 evaluated questions, which we believe is an appropriate sample size for human evaluation.
The result of human evaluation is reported in Table 2, where N is the number of rated system-generated questions, the overall score is computed by the arithmetic mean of fluency, relevance, answerability, and complexity scores. Systems are ranked by the overall score. We can observe that the model Gold has the best overall performance as expected, while our model-2 can outperform all other models. Furthermore, the performances of our model 2 with respect to the four separate aspects—of which fluency can even reach the level of very good (5–6)—are better than the other five QG models, when other models are only deemed to be good (4–5).

4.3.4. Questions of Different Types

We split the questions into seven types: What, Which, Who, How, Where, When and Other (questions without specific interrogative words), and analyze how the QG models prefer to to generate questions of a certain type from these types. Table 3 shows the percentage of question types in system-generated outputs, where reference represents the original data set. We find that the types of questions generated by these QG models is mostly similar to the reference questions. Furthermore, our models and the RNN model are prone to generating What questions, while GPT-2 generates 10% less What questions than the dataset. Among these models, RNN is the only model that generates no Where question.
To take a closer look at the quality of question types, we investigate the overall scores of human evaluation on different types of generated questions. According to the results of Table 1 and Table 2, ROUGE-L is the metric that correlates best with human scores. Thus, the ROUGE-L scores of different question types are also computed. Table 4 and Table 5 show the ROUGE-L and human evaluation scores of systems on our test data divided by different question types.
With respect to ROUGE-L scores shown in Table 4, although GPT-2 has the best quality on Other, Semantic-Graph and Semantic-Graph* achieve the best quality on Where and When. However, our proposed model with the dictionary (model-2) is able to generate What, Which, Who, and How questions with the best quality among all models. According to human evaluation, our model-2 outperforms the other question generation models on What, Who and Other questions, especially on Other questions. It warrants noting that the vanilla RNN model achieves the highest performance on Which, How and When questions. For Where questions, our model-2 and RNN get no human score because no question of this type is evaluated.

5. Discussion and Future Work

Although our model achieves a superior performance over other baseline models, there is still room for improvement, as Table 5 indicates that our model unfortunately performs worse than some other models on Which, How and When questions. Hence, how to incorporate more information in contextual encoding and decoding will be the future direction to be explored.
Besides, current QG models mainly focus on generating questions based on textual input, but the usage of input in other formats (e.g., images, audios and videos) receives less attention. For example, visual QG is a QG problem that takes images as the input, and its applications are also useful for the educational purpose, including child education and interactive lectures [39]. Our further attempt will involve combining our proposed QG model with image understanding approaches [40], and we believe it can be used to generate questions on visual arts for the purpose of helping children with their ability to appreciate art.

6. Conclusions

In this paper, we propose a novel question generation model incorporating the hierarchical encoding–decoding structure in order to inject the structural information of input documents, and a context switch mechanism for the purpose of stabilizing the decoding and making the generation process more consistent. The automatic metric results in Table 1 show our model achieves the best performance against baseline models on ROUGE-L in automatic metrics evaluation, although our model does not outperform baseline models on the other baseline models. Nonetheless, the results in Table 1 prove that our proposed context switch mechanism improves the model’s performance on automatic metrics. Furthermore, the human evaluation results also show our model outperforms all baseline models on four criteria we used. The experimental results of both automatic evaluation and human evaluation support the effectiveness of our proposed approach on the multi-hop QG task. In addition, we also conduct extensive studies analyzing the model’s performance on different question types according to both automatic evaluation metrics and human evaluation scores. Future work will include incorporating our method into pre-trained language models.
The data presented in this study are available in the Supplementary Materials.

Supplementary Materials

Author Contributions

Formal analysis, T.J.; methodology, C.L.; resources, T.J.; software, C.L.; supervision, Z.C.; writing—original draft, T.J. and C.L.; writing—review and editing, Z.C.; data curation, P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Science and Technology Project of Jiangsu Province, China (BK20200978), by the Humanity and Social Science Youth Foundation of the Ministry of Education of China (19YJCZH002), by the Natural University General Project of Jiangsu Province (20KJB580014), by the Nantong Basic Science Research Program (JC2020171), and by Science Foundation Ireland through the SFI Centre for Research Training in Machine Learning (18/CRT/6183).

Acknowledgments

We would like to thank the anonymous crowd-sourcing raters for their work.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Das, B.; Majumder, M.; Phadikar, S.; Sekh, A.A. Automatic question generation and answer assessment: A survey. Res. Pract. Technol. Enhanc. Learn. 2021, 16, 1–15. [Google Scholar] [CrossRef]
  2. Graesser, A.C.; Chipman, P.; Haynes, B.C.; Olney, A. AutoTutor: An intelligent tutoring system with mixed-initiative dialogue. IEEE Trans. Educ. 2005, 48, 612–618. [Google Scholar] [CrossRef]
  3. Kurdi, G.; Leo, J.; Parsia, B.; Sattler, U.; Al-Emari, S. A systematic review of automatic question generation for educational purposes. Int. J. Artif. Intell. Educ. 2020, 30, 121–204. [Google Scholar] [CrossRef] [Green Version]
  4. Room, C. Question generation. Algorithms 2020, 12, 43. [Google Scholar]
  5. Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2369–2380. [Google Scholar] [CrossRef] [Green Version]
  6. Heilman, M.; Smith, N.A. Question Generation via Overgenerating Transformations and Ranking. Available online: https://apps.dtic.mil/sti/citations/ADA531042 (accessed on 1 January 2009).
  7. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014. [Google Scholar]
  8. Du, X.; Shao, J.; Cardie, C. Learning to Ask: Neural Question Generation for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017. [Google Scholar]
  9. Zhang, S.; Bansal, M. Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
  10. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013. [Google Scholar]
  11. Tai, K.S.; Socher, R.; Manning, C.D. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015. [Google Scholar]
  12. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016. [Google Scholar]
  13. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002. [Google Scholar]
  14. Lin, C.Y.; Hovy, E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Edmonton, AB, Canada, 27 May–1 June 2003. [Google Scholar]
  15. Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 9–10 April 2005. [Google Scholar]
  16. Reiter, E. A Structured Review of the Validity of BLEU. Comput. Linguist. 2018, 44, 393–401. [Google Scholar] [CrossRef]
  17. Pan, L.; Lei, W.; Chua, T.; Kan, M. Recent Advances in Neural Question Generation. arXiv 2019, arXiv:1905.08949. [Google Scholar]
  18. Sun, X.; Liu, J.; Lyu, Y.; He, W.; Ma, Y.; Wang, S. Answer-focused and Position-aware Neural Question Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 Ocotber–4 November 2018. [Google Scholar]
  19. See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017. [Google Scholar]
  20. Ma, X.; Zhu, Q.; Zhou, Y.; Li, X.; Wu, D. Improving Question Generation with Sentence-level Semantic Matching and Answer Position Inferring. arXiv 2020, arXiv:1912.00879. [Google Scholar] [CrossRef]
  21. Chen, Y.; Wu, L.; Zaki, M.J. Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation. In Proceedings of the 2019 International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  22. Dhole, K.; Manning, C.D. Syn-QG: Syntactic and Shallow Semantic Rules for Question Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online Conference, 5–10 July 2020. [Google Scholar]
  23. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016. [Google Scholar]
  24. Zhou, W.; Zhang, M.; Wu, Y. Question-type Driven Question Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
  25. Tuan, L.A.; Shah, D.; Barzilay, R. Capturing greater context for question generation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
  26. Duan, N.; Tang, D.; Chen, P.; Zhou, M. Question Generation for Question Answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017. [Google Scholar]
  27. Gupta, D.; Chauhan, H.; Akella, R.T.; Ekbal, A.; Bhattacharyya, P. Reinforced Multi-task Approach for Multi-hop Question Generation. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 13–18 September 2020. [Google Scholar]
  28. Pan, L.; Xie, Y.; Feng, Y.; Chua, T.S.; Kan, M.Y. Semantic Graphs for Generating Deep Questions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
  29. Xie, Y.; Pan, L.; Wang, D.; Kan, M.Y.; Feng, Y. Exploring Question-Specific Rewards for Generating Deep Questions. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 13–18 September 2020. [Google Scholar]
  30. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  31. Williams, R.J.; Zipser, D. A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1989, 1, 270–280. [Google Scholar] [CrossRef]
  32. Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar]
  33. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS 2014 Workshop on Deep Learning, Montreal, QC, Canada, 12 December 2014. [Google Scholar]
  34. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  35. Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.; Liu, N.F.; Peters, M.; Schmitz, M.; Zettlemoyer, L. AllenNLP: A Deep Semantic Natural Language Processing Platform. In Proceedings of the Workshop for NLP Open Source Software (NLP-OSS), Melbourne, Australia, 15–20 July 2018. [Google Scholar]
  36. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar]
  37. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. Available online: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 11 October 2021).
  38. Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
  39. Patil, C.; Patwardhan, M. Visual Question Generation: The State of the Art. ACM Comput. Surv. 2020, 53, 1–22. [Google Scholar] [CrossRef]
  40. Castellano, G.; Vessio, G. Deep learning approaches to pattern extraction and recognition in paintings and drawings: An overview. Neural Comput. Appl. 2021, 33, 12263–12282. [Google Scholar] [CrossRef]
Figure 1. The structure of the proposed seq2seq model, including the encoder (left) and the decoder (right).
Figure 1. The structure of the proposed seq2seq model, including the encoder (left) and the decoder (right).
Entropy 23 01449 g001
Figure 2. The structure of the context switch mechanism employed in our model.
Figure 2. The structure of the context switch mechanism employed in our model.
Entropy 23 01449 g002
Table 1. Results of different QG models on the HotpotQA testset, the evaluation metrics are ROUGE-L, METEOR and BLEU-[1,2,3,4]. A score in bold indicates the model performs best according to that metric.
Table 1. Results of different QG models on the HotpotQA testset, the evaluation metrics are ROUGE-L, METEOR and BLEU-[1,2,3,4]. A score in bold indicates the model performs best according to that metric.
ModelROUGE-LMETEORBLEU-1BLEU-2BLEU-3BLEU-4
Our model-227.3418.2526.4813.878.465.54
Our model-126.9217.6626.6013.698.345.47
RNN26.4316.5625.2113.068.035.37
Semantic-Graph*26.0620.7127.2014.419.026.05
GPT-226.8216.6227.0117.7212.499.14
Semantic-Graph25.7420.3226.5513.948.575.56
Table 2. Results of the human evaluation experiment, where the overall score is the mean of fluency, relevance, answerability, and complexity, and N is the number of collected ratings. A score in bold means a model besides Gold performs best according to that evaluation aspect.
Table 2. Results of the human evaluation experiment, where the overall score is the mean of fluency, relevance, answerability, and complexity, and N is the number of collected ratings. A score in bold means a model besides Gold performs best according to that evaluation aspect.
ModelNOverallFluencyRelevanceAnswerabilityComplexity
Gold4085.055.165.005.094.97
Our model-23954.965.014.944.934.94
Our model-14174.864.914.874.804.87
RNN3824.834.794.894.804.85
Semantic-Graph*3944.744.634.804.794.76
GPT-24184.694.774.724.524.77
Semantic-Graph4064.624.644.684.574.58
Table 3. The proportion (%) of question types in the outputs from different models. Question types are ordered by the results of reference.
Table 3. The proportion (%) of question types in the outputs from different models. Question types are ordered by the results of reference.
ModelWhatWhichWhoOtherHowWhereWhen
Reference40.823.015.910.14.14.02.2
Our model-255.216.616.18.40.60.33.0
Our model-149.623.513.89.41.00.92.1
RNN49.729.711.17.80.40.01.5
Semantic-Graph*37.721.317.014.74.14.01.4
GPT-230.226.111.126.42.22.41.8
Semantic-Graph36.220.215.418.73.22.93.4
Table 4. ROUGE-L scores on question of different types. A score in bold means a model has the highest ROUGE-L score on that question type.
Table 4. ROUGE-L scores on question of different types. A score in bold means a model has the highest ROUGE-L score on that question type.
ModelWhatWhichWhoOtherHowWhereWhen
Our model-226.5729.8626.7328.3230.2325.4927.76
Our model-126.5027.1525.9629.2929.1625.8629.38
RNN25.9226.9824.9030.0620.19-26.34
Semantic-Graph*26.0525.8025.4025.6729.9326.2531.16
GPT-224.1527.0323.5030.6027.2623.9229.16
Semantic-Graph26.1126.2424.7424.7628.7927.9624.07
Table 5. The overall human score on questions of different types. A score in bold means a model has the highest human evaluation score on that question type.
Table 5. The overall human score on questions of different types. A score in bold means a model has the highest human evaluation score on that question type.
ModelWhatWhichWhoOtherHowWhereWhen
Our model-24.974.804.945.275.63-4.82
Our model-14.924.764.854.605.675.005.15
RNN4.814.954.684.676.00-5.50
Semantic-Graph*4.764.624.864.884.474.843.94
GPT-24.954.614.364.595.615.284.04
Semantic-Graph4.634.794.594.504.424.724.17
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Ji, T.; Lyu, C.; Cao, Z.; Cheng, P. Multi-Hop Question Generation Using Hierarchical Encoding-Decoding and Context Switch Mechanism. Entropy 2021, 23, 1449. https://doi.org/10.3390/e23111449

AMA Style

Ji T, Lyu C, Cao Z, Cheng P. Multi-Hop Question Generation Using Hierarchical Encoding-Decoding and Context Switch Mechanism. Entropy. 2021; 23(11):1449. https://doi.org/10.3390/e23111449

Chicago/Turabian Style

Ji, Tianbo, Chenyang Lyu, Zhichao Cao, and Peng Cheng. 2021. "Multi-Hop Question Generation Using Hierarchical Encoding-Decoding and Context Switch Mechanism" Entropy 23, no. 11: 1449. https://doi.org/10.3390/e23111449

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop