Question Answering Systems for Covid-19

In the present scenario COVID-19 pandemic has ruined the entire world. This situation motivates the researchers to resolve the query raised by the people around the world in an efficient manner. However, less number of resources available in order to gain the information and knowledge about COVID-19 arises a need to evaluate the existing Question Answering (QA) systems on COVID-19. In this paper, we compare the various QA systems available in order to answer the questions raised by the people like doctors, medical researchers etc. related to corona virus. QA systems process the queries submitted in natural language to find the best relevant answer among all the candidate answers for the COVID-19 related questions. These systems utilize the text mining and information retrieval on COVID-19 literature. This paper describes the survey of QA systems-CovidQA, CAiRE (Center for Artificial Intelligence Research)-COVID system, CO-search semantic search engine, COVIDASK, RECORD (Research Engine for COVID Open Research Dataset) available for COVID-19. All these QA systems are also compared in terms of their significant parameters-like Precision at rank 1 (P@1), Recall at rank 3(R@3), Mean Reciprocal Rank(MRR), F1-Score, Exact Match(EM), Mean Average Precision, Score metric etc.; on which efficiency of these systems relies.


Introduction
Since the detection of SARS-CoV-2 [1] or the Corona virus towards the end of Dec 2019 or the starting of 2020 the lives of people has drastically affected all over the world [2]. The researchers, doctors, health practitioners must keep up to date information about this corona virus in order to save the lives of infected people to some extent. For this, Question Answering (QA) Systems has been a valuable source to find the accurate answers or the solutions to the questions during this COVID-19 pandemic. The effectiveness of the various question answering techniques is to find the most accurate answer out of multiple relevant answers. As the users tried to find out the information that is available online via search engines like Microsoft Bing, Yahoo, Google etc. the need for the automatic question answering based upon the domain COVID-19 becomes more urgent [3]. In order to get the exact information or an answer to a particular question has been a challenging task as large amount information about COVID-19 is available on Google and other search engines. In order to solve these issues various researchers uses the tools from Natural Language Processing, Machine Learning and Artificial Intelligence Using these tools QA systems try to get accurate answers for every different question instead of retrieving appropriate text documents [4]. In this paper we analyze the various question answering systems on COVID-19 based on natural language processing like CovidQA, CAiRE (Center for Artificial Intelligence Research)-COVID system, CO-search semantic search engine, COVIDASK, RECORD (Research Engine for COVID Open Research Dataset) etc.
These question answering systems are tested using various covid-19 datasets available. The effectiveness of these systems is also compared using various metrics like F1-Score, Precision, Recall, Mean reciprocal Rank, opinions and then eliminate the words with highest frequency for each opinion, Exact Match (EM) etc.

Existing Question Answering Techniques for Covid-19
Raphael Tang et al.(2020) [4] presented CovidQA, the question answering dataset particularly designed to answer the questions related to COVID-19. CovidQA consists of 124 pairs of questions and documents. Using CovidQA, a number of techniques for transfer and unsupervised based question answering, number of transformer models was evaluated. Transformer based models comes out to be more effective when questions are in the form of natural language query rather than keyword based query. The input to CovidQA consists of (i) questions are put in the form of natural language query (ii) Complete text of ground truth article. Then the scoring of each sentence is done to get the accurate answer. To get an exact answer, pattern matching approach was used. Effectiveness of the answers are evaluated in terms of precision at rank one (P@1), mean reciprocal rank(MRR) and recall at rank three (R@3). Various models were compared like BM25, unsupervised neural techniques like BERT i.e. Bidirectional Encoder Representations from Transformers) [6], SciBERT i.e. Bidirectional Encoder Representations from Transformers for scientific data [7], and BioBERT i.e. Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) [8]. BERT, BioBERT and T5 i.e. Textto-Text transfer transformer [9] all are fine-tuned on MS MARCO(Machine Reading Comprehension Dataset) [10] and BioBERT (fine-tuned on SQuAD i.e. Stanford Question Answering Dataset)) [11]. Among all these when the query is in the form of natural language, T5 achieves the highest overall effectiveness with precision at rank one value equals to 0.282 [4], recall at rank three value equals to 0.404 [4] and MRR value equals to 0.415 [4]. When the query is keyword based then BERT fine tuned on MS MARCO performed well in terms of precision at rank one (P@1) with value equals to 0.234 [4] and T5 performed well in terms of recall at rank three and MRR with values equals to 0.376 [4] and 0.360 [4] respectively. Dan Su et al.(2020) [12] presented a CAiRE (Center for Artificial Intelligence Research)-COVID system consists of three major components (1) Document Retriever (2) Snippet Selector for relevant document and 3) Multi-Document Summarizer which is query-focused. Further, Document Retriever consists of two sub components Query Paraphrasing and Search Engine. The architecture of this  Figure 1'. First of all, to retrieve the most accurate document that contains the answer based upon the user query, document retriever component is used. The sub component of this module is used to rephrase the user query into some simple query, keep in view the meaning of the query remains the same. Then the sub component search engine is used to get the candidate documents for the query. Snippet Selector for relevant document is used to get the most accurate answer from the candidate answers. The Re-ranking score was calculated on the basis of Answer Confidence Score and Keyword-based Score. Answer Confidence Score is the prediction probability of the answer from the QA models. Keyword based Score is calculated by word matching between the query and the retrieved paragraphs for answers. Matching is done using POS tagging. Then the Re-ranking score is the summation of Answer Confidence Score and Keyword based Score. Multi-document Summarizer which is Query-focused is used to extract the abstractive summaries related to the queries of COVID-19. Matching is done using POS i.e., Part-of-Speech tagging.

Figure 1. Architecture of CAiRE-COVID System
Then the Re-ranking score is the summation of Answer Confidence Score and Keyword based Score. Query-focused Multi-document Summarizer is used to extract the abstractive summaries related to the queries of COVID-19.
Hillary Ngai et al. 2021 [13], assessed the question-answering (QA) models on two labelled questions answers datasets CovidQA and CovidGQA using three suggested transformer-based question-answering systems BERT [6], ALBERT [14], and T5 model. The systems were pre-trained on various QA datasets like Stanford Question Answering Dataset (SQuAD) v1.1 [11], Stanford Natural Language Inference (SNLI) [15], Multi-Genre Natural Language Inference (MultiNLI) [16], Semantic Textual Similarity(STS) [17], Question-Answering Biomedical Dataset(BioASQ) [18] and evaluated on two labeled questions answers datasets -CovidGQA and CovidQA. The CovidGQA is a COVID-19 dataset created manually and encompasses 198 general question-text-answers related to COVID-19. The question-text has been taken out from medical websites, and medical subjectmatter experts (SMEs) have provided answers, whereas, the CovidQA is a COVID-19 questionanswering dataset developed manually from knowledge collected from the Kaggle CORD-19 dataset  [19]. To extract each article's relevant text to answer each question in the dataset, CovidQA dataset is combined with the CORD-19 dataset. The finally evaluated dataset contains 69 question-textanswer triplets. For the evaluation of QA systems, the evaluating metrics the macro-averaged F1 score and Exact Match (EM) of the answer extraction methods of QA systems on each dataset is used. BERT-large-uncased and ALBERT-base-uncased use whole word masking and are pre-trained on SQuAD v1.1, while T5-large was pre-trained on Colossal Clean Crawled Corpus (C4). The result obtained from the comparison of generated answer lengths on the CovidGQA dataset shows that BERT-large achieves the highest macro-averaged F1 score on both datasets, whereas, ALBERT-base unexpectedly outperforms BERT-large on EM, achieving the highest EM of the three models [13]. Although T5-large is having the most significant number of parameters (770M), both BERT-large and ALBERT-base outperform T5-large on all metrics on both datasets because T5-large was not pre-trained on a QA task. Based on the preliminary results obtained from BERT, ALBERT, and T5 model on two QA datasets and considering one of the significant limitations of transformers that they require a lot of labeled QA pairs to reach acceptable performance Hillary Ngai et al. 2021 [13], proposed a hybrid QA system SBERT-BERT-QA and LDA-ALBERT-QA that combines few-shot learning with a transformer.
Estiva et al. [20] presented CO-search semantic search engine. It is a retrieval ranker based semantic search engine. Retrieval of documents is done using two keyword-based models and semantic model. Documents are embedded to create an index using pre-trained SBERT(Siamese-BERT) [21] used to embed the paragraphs. Further these embedding are combined with TF-IDF (Term Frequency-Inverse Document Frequency) and BM25. The linear combination of SBERT paragraph-level and TF-IDF document-level is used to retrieve the scores for each document. Then apply the reciprocal ranked fusion to combine this retrieved document with that obtained from BM25. Then the ranking is done with the help documents by the scores of retrieved documents, Question Answering module output and summarizer output. Various key metrics are used in evaluation like Precision, Bpref i.e. Binary Preference, MAP i.e. Mean Average Precision and nDCG i.e. After evaluation, CO-search semantic search engine ranked best among automated systems.
Lee et al. [22] presented COVIDASK question answering system. This system incorporates the bio medical text mining into existing question answering technique. It is used to provide the answers to questions in real time. The questions can be of the forms: interrogative questions (wh-type questions) like "where did covid new strain comes from?" or keyword based queries like "covid new strain come. Information Retrieval is also used for the evaluation purpose. Some factors are taken into account like Recency means for each question, up-to-date information or a document is to be used for the extraction of each answer. Another factor is how significant the resources are? i.e. answers are to be extracted from research papers that are published at venues of good reputation. The evidence document must be provided with each extracted answer. Latency must be minimum i.e. delay between inputting a question and receiving an answer must be minimum. In this all the phrases contained in CORD-19 [23] are pre-indexed and it is further used to build DENSPI(Dense Sparse Phrase Index) [24] model. It also used biomedical named entities described in PubMed for building BEST(Biomedical Entity Search tool [25]. For a given query, COVIDASK returns different answers from both DENSPI and BEST. Firstly each vector of the phrase is supervised with an extractive dataset of Question Answering such as SQuAD [11]. Then all the candidate answer phrases are encoded into sparse and dense vectors. BEST makes inverted index and entity level search results are returned. BEST basically works well with keyword based queries. BEST is preferable over DENSPI. COVIDASK uses CORD-19 corpus related to coronavirus queries for the phrase indexing phase of DENSPI while BEST utilizes articles on PubMed. Two extractive datasets are used for training DENSPI: SQuAD [11] [28] Challenge. The performance is superior on interrogative questions, when DENSPI + SPARC is trained on SQuAD. DENSPI which is trained on natural questions outperforms well upon keyword-based questions. The results on TREC-COVID shows that COVIDASK is not very effective when compared with the other systems that are fully dedicated to information retrieval rather than question answering. COVIDASK efficiently handles the natural language questions both in keyword and interrogative forms. Lu et.al [29] proposed a pipeline to describe what the CORD-19 articles revealed by automatically return the answers for the questions related to COVID-19. Then the aggregation of the answers is done. This pipeline consists of various modules: (i) Context Retrieval Module, it is used for finding out the only the relevant passages from the entire documents. It uses BM25, an enhanced version of TF-IDF to rank the passages based upon their similarity with the question. The passages which have the highest similarity with the question are selected. (ii) Question-Answering Model, given a question and passage chosen by context retrieval module BERT (Bidirectional encoder representations from transformers) [6] is used for the prediction of location and the length of the answer in each passage as shown in ' Figure 2'.

Figure 2. Demonstration of Pipeline
(iii)Opinion Aggregator, used for the aggregation and summarization of answers into opinions. It uses BTM (Biterm Topic Model) for revealing the topics that act as a basis for answers. Using BTM, a topical feature vector is calculated for each answer and then each feature vector for every answer is clustered using k-means clustering algorithm. Muffo et al. [30] presented RECORD i.e. Research Engine for COVID Open Research Dataset. It is the tool to provide the answers to the questions based on COVID-19. It uses CORD-19 dataset, focused only on the articles about COVID-19 and SARS-CoV-2. Three steps are involved in this system: (i) Preprocessing Step used to split the entire text into chunks that are related to query posed by the user. Entire text is broken down into paragraphs in such a way so that semantic relationship between the query and the extracted paragraphs are not broken. (ii) Embedding Step used to embed all the chunks in the body text. Basically, the query and the related chunks both are embedded using Sentence-BERT model [21]. Cosine similarity is calculated between sentence embedding based on query and sentence embedding based on the chunks. It is used extract the most semantically related text documents to the query. (iii) Question Answering Step used to extract the answer. Question answering model based on the architecture of BERTLARGE is used. This step used to provide the best answer among all the chunks along with the score. For each chunk, a score is calculated then among all the chunks, a chunk that has the highest score is considered as a best answer among all the related answers. Score is used as a discriminating factor among all the related chunks for a given query. RECORD also provides the title, author, publishing journal, paper citations, Scimago Journal Score etc. The summarization table of different question answering systems on COVID-19 is given in Table 1.  [30] Score metric in terms of precision and consistency with input question RECORD outperforms well in providing consistent and precise answer.

Conclusion
In the recent pandemic situation QA systems can help each of its users to acquire relevant information effectively. However, most of the users like to submit their queries in their natural languages for processing to these systems. Therefore, QA systems are designed and modeled to process queries by using natural language processors. The performance of these QA systems can be analyzed by using various metrics-like Precision at rank 1 (P@1), Recall at rank 3(R@3), Mean Reciprocal Rank (MRR), F1-Score, Exact Match(EM), Mean Average Precision, Score Metric as discussed. On the basis of these metrics we observe that some QA systems are efficient to process queries on the basis of keywords matching like BERT while others on the basis of their semantic analysis like T25. At the same time SBERT with combination TF-IDF and BM25 shows a remarkable efficiency to process queries by using keywords matching as well as semantic analysis. It implies that QA systems should process the queries by using both i.e. keyword matching and semantic analysis. Conclusively it is observed that CO-search semantic search engine based on SBERT with combination of TF-IDF and BM25 is more preferred QA system in recent times.