Combined Approach for Answer Identification with Small Sized Reading Comprehension Datasets

ABSTRACT


INTRODUCTION
Machine reading and comprehension framework requires understanding of the given paragraph text and answering the questions based on the given paragraph text.In this work input consists of paragraph P, questions based on paragraph (qi), four answer options (ai) and one correct answer option.With the given input, system identifies the correct answer option out of the four given options.This work emphasizes role of pretrained language tools such as POS tagger, dependency graph and pre-trained language models like Sentence-BERT, to identify the answers from the paragraph text data.
It is very apparent that machine need to understand natural language text to extract the relevant part of the passage and provide answer.In AI based systems, it has been challenging task to extract relevant information from textbook data.In Machine Reading and comprehension (MRC) systems information can be represented as: {{P}, {Q}, {A}} where P={s1, s2, s3, ..., sn} have number of sentences (si) of the paragraph P; Q={q1, q2, q3, ..., qn} represents questions based on paragraphs and A={a1, a2, a3, a4} where ai is an answer option.
Hypothesis: Given a paragraph text, question qi, answer options ai, system identifies the correct answer option if combination of q+ai matches with relevant sentence/sentences of the given Paragraph.Set P={S1, S2, …, Sn}, is represented as a set of sentences.Set Q={q1, q2, ...}, indicates no of questions on paragraph.Set A={a1, a2, a3, a4} indicates, four possible answer options, out of which one of the answer option is correct.Answer option sentence (q+ai) is obtained by combining question with answer option.
Lexical similarity indicates matching of extractive features.Noun, verb, preposition and certain combinations like noun phrase chunk, verb-preposition (v-prep), noun with preposition (n-prep), are considered as extractive features.
Meaningful grouping of words on the basis of syntax is indicated by noun phrase chunks, and verb phrase chunks.Such meaningful grouping of words indicate lexical semantic feature.These features are identified in answer option sentence and paragraph sentence.These features are obtained by using POS Tagger and dependency graph by executing Stanford's annotation pipeline.
Semantic similarity indicates the equivalence of sentences based on mainly noun phrases, verb phrases and prepositions.It is achieved by using sentence embeddings and cosine similarity feature.
BERT (Bidirectional Encoder Representations from Transformers) is a transformer based model, that is pre-trained on large corpora of text data.Sentence-BERT (SBERT) is a modification of the pre-trained BERT network that uses siemese triplet network structure to derive semantically meaningful sentence embeddings [1].
SBERT (sentence-BERT) transformer is used to identify semantic textual similarity between pair of sentences, namely answer option sentences and paragraph sentences.
SBERT generates semantically meaningful sentence embeddings.These embeddings are compared using cosine similarity.This feature is used to identify the equivalent answer option sentence with paragraph sentence.
Various features of lexical and semantic similarity are taken into consideration in combination with SBERT sentence embeddings.
The proposed approach is the combination of knowledge base and usage of transformer model for sentence embedding.Knowledge base is constructed by considering the dependency graph returned by Stanford CoreNLP parser.Every sentence is represented as a sequence of tokens.In the dependency graph every token is connected with other token having certain grammatical relationship.From this grammatical dependency relationships specific combinations of patterns can be generated.
Knowledge graph is an another representation.It is an abstract graph that consists of nodes corresponding to entity and edges corresponding to specific grammatical relationships.In this representation, noun and noun phrases are the entity nodes which are connected with edges corresponding to grammatical relationships like verb, preposition and verbpreposition edges.It is needed to identify whether a verb edge exists among two noun/noun phrase chunks?, or is there a verb-preposition edge?Knowledge graph is generated from the dependency relations of noun, verb, and preposition provided by dependency graph.This knowledge graph is stored as a knowledge base.
In the recent work in reading comprehension most of the systems are based on deep learning approaches [2][3][4][5][6] which need huge amount of input data.The proposed approach is for small sized textual data which is not sufficient for deep learning.Another significance of the work is to identify the framework which can be generalized for different domains/genres.Dataset of science textbook and stories are completely different kinds of genres.There is a need of common framework that can be applied to machine reading comprehension system of different genres having small dataset.
Recent state of the art BERT models helps in locating the correct answer based on the context, but these models donot perform well for certain compound words/phrase matching tasks.It has been observed that the compound noun having cardinal values are not considered equivalent by SBERT model.The sentences having noun phrase chunks like 12-18 years and 12 to 18 years are not identified correctly by SBERT model.In contrast to this, compound nouns with cardinal values are identified as a meaningful noun phrase chunk by shallow parsing with dependency graph.Grammatical features of the text provides a clue for identifying extractive features from the passage text.These features are further augmented with embedding feature provided by SBERT model.Many researchers have emphasized use of popular word embedding models like word2vec, Glove, Fasttext for textual level similarity.In the recent work in sentence embeddings, BERT based transformer models has been found fruitful for identifying sentence level semantic similarity in low resource datasets.In this work we have considered the extractive features alongwith BERT (sentence-BERT) embedding score to identify the answer out of the given options.We have used 'all-MiniLM-L6-V2' transformer model from Huggingface open source Library [7] for obtaining sentence embeddings.This approach is combination of extractive features based on lexical semantics and semantic features based on embeddings.
It has been observed that the proposed methodology is applicable to both the domains.Some of the problems identified with proposed methodology are listed in evaluation analysis.It can be taken as a future research.This methodology can be visualised to demonstrate the stepwise procedure followed in reading comprehension.
This paper is organized in different sections as follows.In section 2, we have described the various features applicable to language understanding applications.Section 3 discusses the proposed methodology based on knowledge base generation and sentence embeddings.Section 4 mentions findings of the experimentation and error analysis.Finally we conclude the challenges in machine reading application and provides future directions.

RELATED WORK
Machine learning approaches are mainly used for construction of feature space required in comprehension based systems [8].In the work [9] author has described task specific retrievers to get relevant contexts at an appropriate level of semantic granularity.Complex QA system [10] describes the different language model architectures, strategies and challenges in terms of task complexity and evaluation.Major features required for token level and sentence level tasks are described in the following section.

Lexical semantic feature identification
In case of reading comprehension major task is to locate relevant parts of the passage by identifying entities, relations, lexical properties [11,12] semantic properties [13][14][15][16][17][18] of the given text.Answer can be extracted directly from the paragraph text or from some intermediate form.While using an intermediate form, sentences of the paragraphs are represented in structured format like database table, semantic graphs or annotated form as an intermediate form.Questions are answered based on such intermediate representation [18].The most common way of dealing with MRC tasks is to train machine learning model on an annotated database [19].In case of hybrid form of QA [20], multiview is considered for answering over table and text.In multiview author has explained question answering based on span of text and tabular data.In our work dependency graph is used to extract the span of text which corresponds to noun chunk phrases.

Significance of word embedding
Pretrained word embeddings are an integral part of modern NLP systems, which offer significant improvements over embeddings learned from scratch [1].Word embedding is a type of word representation that enables words with similar meaning to have similar representation.In this technique each word is represented by a real valued vector, which has almost hundreds of dimensions.Latent semantic analysis and skip gram are the mostly used methods for learning word vectors.In the recent work [21], FastText is considered to identify word similarity related to dilects in Arabic text for opinion analysis.In another work [22,23] embeddings are created using popular algorithms like word2vec, FastText, Glove (Global vectors for word representation) etc.
In the recent work many authors have described various models using BERT embeddings.

Role of BERT embedding
In the recent work with BERT embedding [24], author has described textual entailment for classification for legal text documents.In this application, author has mentioned the use of sentence BERT model along with metadata of the civil court for entailment classification.Author [25] has described need of pretrained models with BERT for the task of retrieval and classification of scientific abstracts.In another work [26] author has used BERT's transfer learning ability for enhancing performance of decision making in sentiment analysis.In this work, authors have also compared popular word embedding techniques such as Glove, Fast Text and Word2Vec with BERT.Combination of improved BERT model (iBERT) [27] and dependency trees are used to construct semantic representation of the text in sentiment analysis.In case of BERT-based method (BERT-ConvE) [28], embeddings are used to represent node text attributes to complete the knowledge graph.

S-BERT for sentence embedding
Sentence-BERT (SBERT) is a modification of the pretrained BERT network that uses siemese triplet network structure to derive semantically meaningful sentence embeddings [1].BERT's model architecture is a multilayer bidirectional transformer encoder based on original implementations of attention [30] as shown in Figure 1 and Figure 2. The Biencoder produces embeddings for the paragraph sentences as well as for the answer option sentences.These embeddings are produced independently.SBERT model enhances the BERT model by adding a pooling operation to its output.It is shown by Figure 3.
In this architecture sentence A correspond to paragraph sentences and sentence B corresponds to answer option sentences.U and V represents sentence embeddings.

Dependency parsing and knowledge generation
In an abstract form, sentences of a paragraph are represented with different graph structures such as sequence graph and dependency graph [31].Stanford CoreNLP parser provides the annotation pipeline where sentences are represented with different grammatical relationships.Shallow parsing with enhanced dependency parse provides semantic information associated with textual data such as noun phrase/chunks, verbs, preposition phrases, clauses etc. Grammatical features extracted from dependency parser provide semantic information, needed for relevant textual information.This semantic information can be represented in a structured form like a knowledge graph and predicate argument structure [32].
The Stanford CoreNLP library provides API's which can perform different text operations for natural language processing like parsing, tokenization, lemmatization, parts of speech tagging, chunking, sentence segmentation, Named Entity Identification and coreference identification.NLTK, OpenNLP, Spacy Toolkits are also available to build more advanced text processing services for processing of natural language text.Figure 4 shows representation of sequence graph and dependency graph obtained using CoreNLP parser.Sentence: The black salt is obtained from rocks.

METHODOLOGY
Question answering is the task of identifying answer for the question based on the support text.Most of the comprehension based answering systems locate the correct answer in the given paragraph by identifying proximity with the question words [34].In the proposed comprehension based QA system, textual data is given in the form of paragraph sentences.Questions are given in the form of multiple choice options.Answer is the relevant sentence of paragraph matching with the answer option sentence.Answer option sentence is formed by combining multiple choice option with the given question as mentioned in section 3.1.
The problem of answer extraction is treated as an optimal subgraph identification in the given paragraph text.In order to get the optimal subgraph, the three major predicates such as noun phrase chunks, verb and preposition have taken into consideration.The detailed methodology is described with Figure 5 and Figure 6.

Construction of answer option sentence
In the given setup four options are given for every question.
The multiple choice answer options are in the form of a single word, phrase or a brief sentence.Answer option sentence is formed by combining question with the given answer option by using patterns mentioned with specific formats [35].The question like Which mineral makes strong bones? is replaced with Which +V.<makes>.(+NP).
It can be written as NP +V.<makes>.(+NP), where 'which' word can be replaced with given option represented as a noun phrase (NP).
Answer option sentences are formed by crowd workers using the pattern identification and rewriting rules as mentioned above.This task is done in offline mode.There are different types of pattern rewriting rules [35].Answer option sentences are the framed answers obtained from given answer options.

Top sentences identification and storage
There are few sentences of paragraph which match with the answer sentences.It means answer of the question lies in specific cluster of sentences.Initially Top paragraph sentences are identified by considering cosine similarity feature among answer option sentence embeddings and paragraph sentence embeddings.This Pairwise sentence scoring task is performed as shown in Algorithm 1. Identified Top sentences are stored in the form of lists.This algorithm is applied to reduce the search space.

Generation of knowledge graph
The process of answer extraction from paragraph text needs identification of relationships among the nodes of the graph.It is essential to identify semantic roles of major predicates like noun, verbs, and preposition.
Knowledge graph is an abstract graph that consists of nodes corresponding to entity and edges corresponding to specific grammatical relationships.Noun and noun phrases are the entity nodes which are connected with edges corresponding to grammatical relationships.Depparse pipeline of Stanford CoreNLP Tool is used to obtain different dependency relationships (dependency associated with noun, verb, and preposition phrases appearing in the sentence).This information is used to generate knowledge base.This knowledge base consists of lists of specific patterns.Knowledge graph is represented as a knowledge base to store specific relationships.

Extraction and pattern generation phase
Initially every sentence is represented as a sequence of tokens.Every sentence of the paragraph is a connected graph G (V, E) represented as a dependency graph, where V(nodes) are tokens and these tokens are connected with certain grammatical relationship with other tokens.G1(V1, E1) is the graph for every answer option sentence.In order to generate knowledge from these graphs, specific combinations of patterns are considered.Noun phrase (NP) with premodifiers consists of article, adjectives and noun.
Noun phrase is a meaningful groupings of tokens.With this combination NP of answer option sentence is matched/mapped with NP of paragraph sentence.
Combination 2: G (Si) -shallow parse, noun phrase, verb G1 (Ai) -shallow parse, noun phrase, verb Combination 2: It is observed that whether noun phrase appear along with verb.Verb is a specific relationship which is connected with subject noun and object noun.Specific patterns are created from these combinations, those are termed as annotated patterns, e.g.patterns like verbpropositions (vprep means verb followed by preposition).
Created patterns are stored in structured form.The features considered in this setup are noun, noun phrase, prepositions, verbs, adverbs, verb-preposition, noun-preposition, subjectobject edge corresponding to verb.
Algorithm 2 and 3 indicate extractive feature identification and storage in the form of lists.The top sentences of paragraph are coreference resolved sentences [36].

Mapping algorithm
One to one mapping of features is applied between an answer option sentences and paragraph sentences.Score is calculated on the basis of matching of features in answer option sentence and paragraph sentence.The answer option sentence having maximum score with paragraph sentence is considered as the correct answer option.
If score of an answer option sentence is equal to any other answer option sentence, then the embedding score of all those answer option sentences is considered.Embedding scores of paragraph sentences are obtained.Maximum cosine similarity score is obtained from both the embeddings.The answer option sentence having maximum cosine similarity score is considered as the correct answer option.
In this setup at first score of extractive features is considered.Cosine similarity score of embeddings is considered if score of extractive features is same with more than one anwer option sentence.
Calculation of score based on extractive features and cosine similarity of embeddings is shown with Algorithm 4.

Implementation of embedding in score calculation
Two different approaches are used to calculate the score for identifying answer sentence.In the first approach extractive features are considered along with cosine similarity of embeddings as shown in algorithm 4. The option having highest score is considered as an answer.
In the second approach embeddings score is obtained with sentence-BERT without considering extractive features.In this approach cosine similarity score of embedding is the only feature considered for answer identification.This is described with embedding() function.
Combined score is calculated by adding score from approach1 followed by approach2.The answer option sentence having highest score value is considered as an answer.The corresponding option is the predicted answer.

Combined score calculation using normalisation
Score obtained with extractive features is an integer value while the score obtained with cosine similarity of embeddings is a value in the range of 0 to 1. Euclidean normalization is applied to get the normalized score value from extractive feature score.Normalized_score is in the range of 0 to 1.
Combined score is the addition of normalized_score and cosine_simscore.Predicted answer is the option corresponding to maximum value of combined_score.

Data set under consideration
This dataset comprises paragraphs sourced from elementary and middle school science textbooks, supplemented with multiple-choice questions drawn from competitive examinations.Furthermore, it incorporates science textbook questions from the MultiRC [33] dataset.
Second dataset is MCTest [37], it is also freely available stories data for reading comprehension.It is a dataset about fiction stories at elementary level created by crowd workers.Details of the datasets are as given in Table 1.

Experimental results
Accuracy is considered as an evaluation criteria, where each question has one correct answer among four provided options.Accuracy is defined as:

Type of questions
In case of science textbook dataset majority of the questions fall in the categories like What, Which, Why, How etc.With stories dataset majority of the questions fall in the categories like What, Who, Why, How etc. Table 2 and Table 3 shows statistics about type of questions and count of the correct answers identified with two different approaches.It has been observed that the same methodology is applicable to two distinct domains.With science textbook dataset, accuracy achieved using approach 1 is 60.3% while it is 55% for stories dataset.With approach 2 accuracy is 62.9% and 56% for science dataset and stories dataset respectively.With combined approach, there is an increase in overall accuracy approximately by 6% and 2.5% for science text dataset and stories dataset respectively.With sentence-Bert, there is a slight decrease in the accuracy of nonfactoid questions like why and how.
Combined -Approach 1 followed by Approach 2   Accuracy of answering is dependent on word overlap and phrasal similarity between answer option sentence and probable paragraph sentences.Certain shortcomings of the methodology are identified as below.
(1) Sentence embedding Sentence embedding is used for identifying equivalent noun phrases and equivalent verbs, but it has certain limitations.More work is needed to increase accuracy at contexual level.The sentences having comma separated nouns appearing in the form of list need specific attention.The different grammatical constituents of a sentence need to be explored to generate knowledge.
(2) Referencing Referencing is another problem when identifying connected sentences (cluster of sentences) or connected noun phrases appearing in the same sentence or in subsequent sentences.
Accuracy can be increased by identifying correct reference terms and resolving those references.In this setup, coref pipeline of Stanford parser is used for coreference resolution.
Both the datasets have factoid and nonfactoid questions.Whenever the questions or the paragraph sentence include negation, this system does not predict correct answer.Similarly some of the questions are based on sequence of events or processes.Such questions are not predicted correctly by our system.In the present setup there is no provision for handling these cases.Another challenge for the system is questions based on common sense knowledge and implicit reasoning.This can be taken as a future scope.

Performance evaluation
ALBERT (A Lite BERT) is a pre-trained model that is widely used architecture in question answering domain for fine-tuning.ALBERT configuration is similar to BERT Large and can be trained about 1.7 x times faster [38].Evaluation is performed by finetuning pretrained ALBERT [38] model that need data in SQuAD dataset format.SQuAD is a popular format for training and evaluating language models for Question-Answering tasks.SQuAD format includes passagetext, accompanied with question and corresponding answer.We have converted our dataset into SQuAD data format.After that fine tuning of pretrained ALBERT model is performed on subset of our datasets.
For exact match (EM) answer score value is considered as 1 while for partial correct match answer score is considered as 0.5.In case of no matching, answer score value is considered zero.Performance comparison for both the datasets is given in Table 5.

CONCLUSIONS
We have proposed the generalized methodology for answer identification with small sized Datasets.The methodology is the combination of extractive feature generation and use of sentence embeddings.Extractive features are obtained with the help of dependency graph that considers inherent grammatical relationships.Lexical semantic features provide clue for word and phrase level similarity in information retrieval systems.SBERT model is used for identification of sentence level textual similarity.Combined approach using pretrained language tools and sentence embeddings with SBERT model is found fruitful for answer identification in small sized datasets.This methodology describes stepwise procedure followed in reading comprehension.It can be visualised as a Learning Tool to demonstrate the task of reading comprehension at elementary/middle school level.
The methodology can be further extended by enhancing contextual features related to various grammatical constituents, reference identification, and negation handling features.
) -shallow parse, noun phrase, verb, preposition G1(Ai) -shallow parse, noun phrase, verb, preposition.Combination 3: it is observed that whether the noun phrase appears with verb and preposition.where, Si -paragraph sentence.Ai -answer option sentence In all three combinations noun is a part of noun phrase.Score function with noun phrases, verbs and prepositions is represented as f(x)=f(x1, x2, x3) such as x1, x2, x3 belongs to Si and Ai.x1 → noun phrases, x2 → verbs and x3 → prepositions.f (x) is the score function with x1, x2, x3 that considers one to one mapping between answer option sentence (Ai) and paragraph sentences (Si).

Table 2 .
Dataset 1: Question types and count of predicted correct answers

Table 3 .
Dataset 2: Question types and count of predicted correct answers

Table 4 .
Datasets and accuracy Accuracy results obtained by implementing the proposed methodology with two different genres, datasets are listed in

Table 4 .
Accuracy of correctly predicted answers is shown with Figures 7 to 10

Table 5 .
Performance evaluation with pretrained Q-A model