A Stacked BiLSTM Neural Network Based on Coattention Mechanism for Question Answering

Deep learning is the crucial technology in intelligent question answering research tasks. Nowadays, extensive studies on question answering have been conducted by adopting the methods of deep learning. The challenge is that it not only requires an effective semantic understanding model to generate a textual representation but also needs the consideration of semantic interaction between questions and answers simultaneously. In this paper, we propose a stacked Bidirectional Long Short-Term Memory (BiLSTM) neural network based on the coattention mechanism to extract the interaction between questions and answers, combining cosine similarity and Euclidean distance to score the question and answer sentences. Experiments are tested and evaluated on publicly available Text REtrieval Conference (TREC) 8-13 dataset and Wiki-QA dataset. Experimental results confirm that the proposed model is efficient and particularly it achieves a higher mean average precision (MAR) of 0.7613 and mean reciprocal rank (MRR) of 0.8401 on the TREC dataset.


Introduction
Deep learning forms a more abstract high-level representation attribute feature by combining low-level features to discover the distributed feature representations of data. It provides an effective method for NLP research. In recent years, intelligent question answering in the NLP field has emerged as a prominent discipline research hotspot in both academia and industry, which has been widely used by many influential question answering systems. Answer selection plays a vital role in question answering task, and it mainly encodes QA pair and inputs them into the model to extract the key information and get the corresponding representation [1]. us, the main task is to map the question and answer sentences into a joint feature space to generate the codependent representation for them. In the end, an algorithm is utilized to calculate their similarity.
In the past few years, most question answering studies [2][3][4] were based on knowledge bases and FAQs, which use machine learning to analyze and retrieve keywords. Unfortunately, both of them lack relevant semantic analysis of the questions and answers, which results in a shortcoming of strong artificial dependency and poor scalability.
With the significant innovation of deep learning, deep neural networks are able to availably map the meaning of a single word in a sentence to a continuous representation of the entire sentence, and the meaning of the sentence representation obtained is more complete. Because deep learning reduces the need for manual feature engineering and adapting to new tasks, it has become an important research method for various tasks of NLP in the last several years, and a large number of researchers take advantage of its end-to-end model for sentence semantic analysis to implement question answering tasks. Feng et al. [5] and Wang and Nyberg [6] resorted convolutional neural networks (CNN) and Bidirectional Long Short-Term Memory Networks to capture single sentence semantics, respectively. Nevertheless, both of them ignored the interrelationship between encoded representations of question and answer.
Recently, the model based on the attention mechanism has been explored for question answering. Tan et al. [7] and Nie et al. [8] proposed a BiLSTM model that combines the attention mechanism to construct a better answer representation according to the input question sentences. e model takes the effect of the question on the answer list encoding into account, but they ignore the effect of the answer on the encoding representation of the question, which will cause some deviations in the final prediction result. For instance, the question 1 is "Michael, what are you eating?" and the question 2 is "Michael, why are you eating so much?" and the answer is "Yeah, I'm eating a hamburger." e words "what" & "eating" in question 1 and the words "I'm" & "eating hamburger" in answer have a certain semantic association, and we could easily infer that the answer is corresponding to the question 1. It means that each answer has some intrinsic connection with the question, and to some extent, the question representation is affected by different answers. In addition to analyzing the answers from the questions, we can also infer some results about the questions from the answers.
In this paper, we construct a deep learning architecture for question answering, where questions and answers are limited to a single sentence. e cores of our architecture are two distributed sentence models working in parallel, based on a stacked BiLSTM neural network. We map questions and answers to the corresponding distribution vectors and finally calculate the semantic similarity between them. BiLSTM neural networks have been widely used in recent years to deal with NLP issues [9][10][11]. Zhang and Ma [12] established a new deep learning model based on BiLSTM networks to accomplish the answer selection task and achieved favorable results. Motivated by this work, we utilize the stacked BiLSTM deep neural network that incorporates the coattention mechanism to semantically understand and model the QA pair, thus allowing model to capture long dependency sentence-level features and generate deeper codependent representations for the QA pair. Additionally, the cosine similarity and the Euclidean distance are reconciled as a new metric to measure the semantic similarity and distance between the questions and the answers. Experiments are settled on the Text REtrieval Conference 8-13 QA dataset and Wiki-QA dataset. Comparison shows that our experimental model achieved the best experimental results. e main contributions of this paper are summarized as follows: (i) A stacked BiLSTM neural network is resorted to attain the vector representation of the input sentence, which can effectively capture the semantics of the sentence. (ii) Our model combines coattention mechanism and attention mechanism to encode sentences to obtain the interaction and influence between the QA pair. (iii) e cosine similarity and the Euclidean distance are reconciled to calculate the degree of matching between two vectors. is method is able to take the distance and angle relationship between vectors into consideration. e rest of this paper is organized as follows. Section 2 gives a brief review of related work. Section 3 presents the proposed framework and method for question answering. Section 4 is a detailed analysis and summary of the experimental results. We will draw a conclusion and discuss the next work in Section 5.

Related Work
Research in question answering has been greatly boosted by the Text REtrieval Conference series since 1999. Recently, a number of related works [12][13][14][15] have proposed many efficient models for question answering. We compare and correlate the proposed stacked BiLSTM neural networks, coattention mechanism, and scoring metric with our other methods in the literature as follows.

Long Short-Term Memory Neural Networks.
Previously, traditional research approaches concentrated on syntactic matching between the questions and answers. Punyakanok et al. [3] was the earliest to propose the general question and answer matching model via dependency tree models. Later, both Heilman and Smith [2] and Khan et al. [16] presented a probabilistic tree edit algorithm to model sentence. Yao et al. [17] constructed a linear-chain conditional random field based on TREC-QA dataset, which extracted the answer as the answer sequence labeling problem of the tree editing sentence. Moreover, Zhou et al. [4] resorted lexical model based on word relations to select answer sentences. But these traditional models rely excessively on external conditions such as manual labeling of information, which requires a large amount of related work to achieve.
In the recent work of question answering, the mainstream is based on deep learning methods. Yih et al. [18] and Wang et al. [19] developed a semantic parsing framework by a semantic similarity model using convolutional neural networks. Wang and Nyberg [6] used a stacked BiLSTM network to sequentially read words from the question and answer sentences, which did not require any syntactic parsing or external knowledge resources such as WordNet. However, these models failed to consider the codependent representations of the questions and answers. us, we add attention mechanism to the deep neural networks to capture the associations between the QA pair.

Coattention Mechanism.
e attention mechanism is appropriate for inferring the mapping relationship between different modal data extremely. It can help a framework like a codec to properly acquire the interrelationships of multiple content models, thus expressing more effectively [1]. ere are plenty of related works having explored the attention mechanism in question answering. Based on bidirectional recurrent neural networks, Bahdanau et al. [20] added the attention mechanism to the model to encode and decode the sentence in machine translation. Zhang et al. [21] examined inner attention mechanism and outer attention mechanism in discourse representation for implicit discourse relation recognition. e result showed a marvelous improvement on marco-F1 point is 1.61%. Inspired by the related work in Bahdanau et al. [20] and Fu et al. [22], Tan et al. [7] and Xiang et al. [23] successively proposed an attention mechanism based on bidirectional single-layer LSTMs for question-answer matching, which is able to construct better answer representations according to the input question. Meanwhile, Lu et al. [24] took the lead in presenting a hierarchical coattention model for visual question answering.
ey used the coattention mechanism to compute a conditional representation of the image given the question and a conditional representation of the question given the image. Enlightened by this work, Xiong et al. [10] presented a dynamic coattention network (DCN) to obtain the codependent representations of question and document, and they used a dynamic point decoder to sort potential answers. e experiment achieved 0.8% EM and 2.1% F1 improvement on SQuAD dataset. A more refined coattention model was proposed by Zhang and Ma [12]. e author combined the coattention mechanism with the attention mechanism to encode the representation of questions and answers, and this model significantly utilized the inner relationship between questions and answers to enhance the experiment results. Our research also adopts a similar coattention mechanism to extract the statement features.

Scoring Mechanism.
In many previous works such as Liu [25] and He et al. [26], cosine similarity has been proven to be an effective metric for evaluating the similarity between two chord vectors, and it has been widely used in complex queries and matching in recent years. However, Lee et al. [27] resorted the Euclidean distance as the classification decision-making function to measure the average distance between the new data point and the support vectors from different categories, and the data showed that it is efficient. Feng et al. [5] proposed two novel metrics GESD (Geometric mean of Euclidean and Sigmoid Dot product) and AESD (Arithmetic mean of Euclidean and Sigmoid Dot product) in their answer selection task. ey proposed two metrics that are the best among all the comparison metrics. In the work of Yin et al. [15], the cosine similarity and the Euclidean distance were separately used to calculate the sentence similarity and measure the semantic distance between different sentences. e result revealed that the simultaneous use of two evaluation mechanisms is superior to using only cosine similarity metric. Unlike the previous research, our approach improves and optimizes previous methods by reconciling the two functions. Our results show that the method is efficient.

Proposed Question Answering Model
In this section, we describe the proposed question answering model based on deep learning, which is optimized based on the architecture of Tan et al. [1] and Xiong et al. [10]. e overview of the framework is constructed in Figure 1.
In Figure 1, we first utilize the pretrained GloVe to construct word embedding layer, and this word embedding provides the vector representation for each question and its candidate answers. Second, the stacked BiLSTM neural network serves as an encoder that extracts hidden features from each input sentence. Corresponding representations can be obtained by the questions based on the coattention mechanism. After entering the question vector into the maximum pooling, the attention mechanism is used to generate an answer embedding according to the question representation. At last, we combine cosine similarity and Euclidean distance to measure the degree of matching between the question vector and the answer vector.

A Stacked BiLSTM Neural
Network. LSTM networks architecture was originally developed by Hochreiter and Schmidhuber [28]. More formally, an input sequence vector x � (x 1 , x 2 , . . . , x n ) is given, where n indicates the length of the input sentence. e core structure of the LSTM is the use of three control gates to control a memory cell activation vector c. e first forget gate determines how much of the cell state c t− 1 at the previous time is retained until the current cell state c t ; the second input gate determines the extent to which the input x t of the network is saved to the current cell state c t ; the third output gate determines how much of the cell state c t is transmitted to the current output value h t of the LSTM networks. e three gates are a fully connected layer, and its input is a vector and the output is a real number in [0, 1]. e basic LSTM cell architecture is shown in Figure 2, and its representation is as follows:

Computational Intelligence and Neuroscience
Input gates: where σ is the logistic sigmoid function, x t indicates t-th word vector of the sentence and h t indicates the hidden state, W terms and b terms, respectively, represent weight matrices (e.g., W xf represents the forget gate weight matrix) and bias vectors (e.g., b i represents the input gate bias vector) for the three gates.
To overcome the shortcoming of single LSTM cell that can only capture previous context but not utilize the future context, Schuster and Paliwal [29] invented bidirectional recurrent neural networks (BRNN) to combine two separate hidden LSTM layers of opposite directions to the same output. With this structure, the output layer is able to utilize related information from both the previous and future context. A BiLSTM calculates the input sequence . e encoded vector y t is formed by the concatenation of the final forward and backward outputs, where y � (y 1 , y 2 , . . . y t . . . , y n ) is the output sequence of the first hidden layer. Some previous works represented that by stacking multiple BiLSTM in neural networks, the performance of classification or regression can be further improved [30][31][32].
Moreover, there is some related theoretical support to show that a deep hierarchical model is more efficient in representing some functions than a shallow one [6,33]. We have defined a stacked BiLSTM network where the output y t from the lower layer becomes the input of the upper layer. e stacked BiLSTM structure is illustrated in Figure 3: Defining Q � (q 1 , q 2 , . . . , q n ) and A � (a 1 , a 2 , . . . , a m ) to represent question sequences and answer sequences, respectively, where n and m indicate the length of the questions and answers, and q t and a t indicate the t-th words of the questions and answers. We run a stacked BiLSTM over the questions and answers to obtain their hidden state matrixes H Q and H A , and the mathematics is as follows: where d is the dimension of the hidden state.

Coattention Mechanism for Question Representation.
Here, we implement a coattention mechanism to encode question according to the answer sequences, as shown in Figure 4. Motivated by the work of Xiong et al. [10], we try to enforce more question-answer interactions by designing more careful matrix multiplication, operations, and concatenations in the coattention mechanism.
We first perform matrix multiplication to calculate the affinity matrix L, which includes affinity scores corresponding to all pairs of question and answer words. It can be described as follows: Softmax function is applied to standardize vector elements, and it is effective in dealing with multiclassification and probability distribution problems. Hence, the columnand row-based softmax functions are utilized to generate attention weights for the hidden states of question and answer separately in the following equation: In order to obtain the attention vector of the question in light of each word of answers, we concatenate attention weights and affinity matrix to compute new context vectors C Q and C A . Here, C Q and C A are the results of the interaction between the question and the answer vector:   Computational Intelligence and Neuroscience

Attentive Attention Mechanism for Answer Representation.
To reduce the information loss of stacked BiLSTM, a soft attention flow layer can be used for linking and integrating information from the question and answer words [1,13]. In the proposed model, the attention mechanism is applied to the output of coattention. We assume that C Q t indicates t-th attention context vector of the question, and the max pooling is taken to convert the input into a fixed-length vector output O q . en, the softmax weights of all context vectors (C A 1 , C A 2 , . . . , C A m ) can be learned autonomously according to O q via the attention mechanism, and the weighted context vector O a of the answer is used as the final representation: Here, W am and W qm represent the attention matrices of C A t and O q , respectively. w ms denotes the attention weight vector. e final representation O a of answer is determined by the attention weight S aq (t) for answer context vector of the t-th word. It is normalized by the softmax function, which is proportional to C A t . Higher values for S aq (t) indicate higher correlation between C A t and the question, and the question vector will get more attention.

Answer Scoring Mechanism and Objective Function.
In this work, we resort a method to reconcile cosine similarity and Euclidean distance to evaluate the degree of matching between the questions and answers. Cosine similarity represents the angle between two vectors, and the Euclidean distance represents the distance between two points in Euclidean space. We hope that the distance between the question and the answer semantic vector to be close enough and the angle is small enough, to maximize the similarity calculation between question and answer pair sentence vectors. e schematic diagram of cosine similarity and Euclidean distance is shown in Figure 5.
A vector representation of the question and answer is obtained from the hidden layer of the model. e cosine similarity and Euclidean distance calculation details are as below. Score(O q , O a ) is the final match result: Normalize the cosine similarity to the [0, 1] interval and it can be obtained as follows: where · represents the point multiplication operation, |O q | and |O a | represent the modulus length of the corresponding vector, respectively. ‖O q − O a ‖ 2 is the Euclidean distance between two points, and the values of equations (9) and (10) are in the range of [0, 1]. During training, the positive and the negative samples can be input simultaneously by using the hinge loss function. We define the hinge loss function as the training goal as below: where M is the constant margin, a+ and a− denote the positive answer and the negative answer, respectively. λ and θ represent regularization parameters and neural networks parameters separately.
In the process of training, we utilize the backpropagation algorithm to calculate the gradient zL/zθ and update the parameter θ to achieve the minimization of the objective function [34]. Finally, we update the parameters with the minimum objective function L min .

Experiments
In this section, we will introduce the detailed information of the experimental implementation, including TREC-QA (8-13) dataset and Wiki-QA dataset, model evaluation indicators, and selection of training parameters, and then, we will carefully analyze the experimental results on different datasets to prove that our proposed model has good accuracy and robustness.  Computational Intelligence and Neuroscience

Datasets.
In this part, we mainly introduce two public datasets, TREC-QA (8-13) dataset and Wiki-QA dataset, and we also introduce the source, data characteristics, and the number of Q&A pairs of these two datasets in detail. e experiment is operated on the Text REtrieval Conference 8-13 QA datasets (http://nlp.stanford.edu/mengqiu/ data/qg-emnlp07-data.tgz) to evaluate our model, which was created by Wang et al. [35] and further elaborated by Yao et al. [17]. As shown in Table 1 ey are collected and organized by real data of users. e candidate answer statement comes from the topmost text paragraph returned by the Wikipedia input page. As shown in Table 2, after filtering out the question without the correct answer, a total of 1242 Wiki-QA questions were obtained, and 293 correct answer sentences matched the problem, and the data format of Wiki corpus is not much different from TREC-QA (8)(9)(10)(11)(12)(13).
In this paper, all experiments were performed on Python, MATLAB, and their optimization toolboxes on a computer with an Intel Core 2 Duo 2.93 GHz processor and a Windows 7 operating system.

Evaluation Metrics.
Following the previous works of Wang et al. [35] on this task, two evaluation metrics are utilized for our task: mean average precision (MAP) and mean reciprocal rank (MRR). MAP is the mean average precision score for each query. It reflects the performance of the retrieval system on all queries. e higher the order of related documents returned by the system, the larger the value of the corresponding MAP. MRR indicates the location of the first correct answer associated with the query. e more forward the answer stands, the larger the corresponding MRR value is. Higher values for MAP and MRR indicate better system performance. We resort the official trec_eval (http://trec.nist.gov/trec_eval/) scripts to calculate these metrics: where N q represents the number of all queries and n ai represents the number of all relevant correct answers for query i. P i (r) represents the average accuracy of the i-th query with recall ratio r. rank k represents the position of the k-th correct candidate answer in the entire answer sequence after confidence ranking of the candidate answers for the query. rank i represents the position in which the first correct candidate answer for query i is located in the set of candidate answers.

Experimental Setting.
In this paper, different experimental factors are set to test and evaluate our proposed method, and then our method is compared with other most advanced methods under the same dataset. e neural network model is implemented with TensorFlow library. In the course of training, we continuously observe the performance on the test set and select the highest MAP and MRR score parameters for final evaluation. Our implementation is as follows: (1) Word Embedding. Pretrained GloVe (https://github.com/ stanfordnlp/GloVe) [36] is used as the word embedding layer offered by the shared task with 400 dimensions. In addition, each sentence is padded with OOV (out of vocabulary) handling method to the maximum length of fixed lengths, which is 40 words for question and answer. In the candidate answer pool, we set the number of negative answers K � 5.
(2) Parameter Initialization. During training, we set the minimum batch size to 40 and refer to the Adam [13] experiment on the TensorFlow to initialize the learning rate to 0.001. e margin M is fixed to 0.2 and the regularization parameter λ is set to 1e − 5. Furthermore, we experimented with single-layer BiLSTM, stacked BiLSTM, and stacked BiLSTM with coattention. Each layer of LSTM has a memory size of 200.
(3) Optimization Algorithm. Adam algorithm [37] is resorted with the decay rate of 0.95 to update the parameters and optimize our model. Subsequently, we add dropout layer after word embedding to avoid overfitting and set dropout rate to 0.5. In order to effectively control the weights within a certain range to avoid gradient explosions, the clip gradients method is used and the gradient threshold is set to 5.

Results and Analysis.
In order to verify the validity and accuracy of the algorithm model of the fusion stacked BiLSTM network and the coattention mechanism in the intelligent question answering, we tested and verified the TREC-QA (8-13) dataset and Wiki-QA dataset, respectively, and the experimental results were analyzed and summarized.

Results and Analysis of TREC-QA (8-13) Dataset.
We conducted a comparative experiment on single-layer BiLSTM, stacked BiLSTM, and stacked BiLSTM with coattention on the TREC-QA (8-13) dataset. Figure 6 compares the sentences of semantic analysis with or without coattention. Figure 7 reveals the variation in evaluation metrics with the epochs. Table 3 shows the details of experimental results for all mentioned baselines and our proposed model.
(1) Different from the traditional work of Yih et al. [18] and Yu et al. [38], who analyzed the problem from the perspective of sentence structure, it can be obviously discovered that both our experiments and many previous studies such as BiLSTM [1] and CNN [39] have achieved better performance. ese researches show that the semantic analyses of sentences are very necessary for NLP tasks and the deep neural networks are able to make the sentence vectors more representatives.
(2) We found that our experimental results of the coattention mechanism were significantly better than most of the above results [1,8,38]. Specifically, comparing the results of line 15 with Nie et al. [8], our model achieved 3.52% gain for MAP and 3.83% gain for MRR. ese experimental results strongly demonstrated that coattention mechanism and attention mechanism play an important role in improving NLP experimental results. e proper use of them allows the model to pay attention to the output vectors and extract the critical information well in the case of flexible input format. In this way, they can fix the lexical gap between questions and answers while capturing QA pair correlations.
(3) e experimental index of stacked BiLSTM is better than single-layer BiLSTM when compared line 11 and line 12 with line 13 and line 14, respectively. Furthermore, Wang and Nyberg [6] resorted threelayer BiLSTM networks and achieved an increase in MAP (1.52%) and MRR (1.49%) over single-layer BiLSTM of line 11. In general, the appropriate amount of multilayer BiLSTM networks helps to understand the relationship between words and words in a deep level and better extract the characteristics of the sentence itself. (4) e best MAP (0.7613) and MRR (0.8401) are obtained by incorporating the coattention mechanism into a stacked BiLSTM neural networks and combining cosine similarity and Euclidean distance to calculate the matching degree between two vectors. Our experimental result outperforms the state-ofthe-art baselines of Tan et al. [1,7] by MAP (0.83%) and MRR (0.79%), respectively, which shows that combining the cosine similarity and the Euclidean distance balances the relationship between the angles and distances of two vectors to more effectively match questions and answers.
Firstly, we conducted comparative experiments in the model training process, selected the question and answer statement from the test set of TREC-QA (8)(9)(10)(11)(12)(13) randomly, trained the model with/without coattention mechanism, and obtained the corresponding semantic vector representation through different models. e specific content verified that the presence or absence of a coattention mechanism had an impact on the analytical representation of the semantics of the statement. e comparison results are shown in Figure 6.
In Figure 6, the top row of the four matrices represents the semantic parsing results after the action of the coattention mechanism. e following line does not have this mechanism. It can be seen from the figure that after adding the coattention mechanism, the more critical words of the four sentences get more weights; they are more prominent in the process of parsing the expression of the statement, and the verbs such as "is" and "the." e semantic weight ratio of the articles is correspondingly reduced. e analysis shows that the coattention mechanism has the ability to capture the relationship between the statement itself and the statement and can make the semantic expression of the statement more fully without adding additional artificial conditions.  Secondly, we verified the epoch sensitivity of the above several models under different iteration periods. Figure 7 shows the variation in MAP and MRR for each model. We performed a comparative experiment of five models, including BiLSTM, stacked BiLSTM, stacked BiLSTM with coattention, BiLSTM with coattention, and stacked BiLSTM with coattention; furthermore, we also presented changes in MAP and MRR for the same model at different epochs.
We performed an epoch-number sensitivity analysis on our proposed model, which varied from 5 to 35. Figure 7 displays the changes in the validation data for MAP and MRR when we change the number of epochs. We observed that both MAP and MRR changed with increasing the number of epochs but tended to be stable after epoch 25. However, the MAP and MRR values of some models have a decreasing trend as the epoch number increases more than 30. It reflects that a certain range of iterations is able to enhance the learning ability of the model and improve the experimental results.
We presented an optimized deep model by using stacked BiLSTM, coattention mechanism, attention mechanism, and a combined similarity metric, and our experimental results   Figure 6: Comparison of sentence semantic analysis with or without coattention.
Computational Intelligence and Neuroscience 9 are shown in line 11 to line 15 of Table 3. We compared and summarized our observations as follows.

Results and Analysis of Wiki-QA Dataset.
We did further comparison experiments on the Wiki-QA dataset. Validation of the model on the Wiki-QA dataset makes the proposed approach more convincing. e parameter initialization and preset aspects of the model on the Wiki-QA dataset are basically consistent with the settings of the TREC dataset, where the batch size of the dataset is 30. Because it is also the order of information retrieval and candidate answer rankings, according to the official evaluation data, the evaluation metrics are selected as MAP and MRR.
We also validated the various models of the design under different epochs on the Wiki-QA dataset, as shown in Figure 8. It can be seen from the figure that the model tends to be stable as the epoch reaches 30 times. When the number of epoch continues to increase, both MAP and MRR have a slight downward trend. e experimental results not only prove that the problem-solving of the model architecture analysis in this paper is effective for the sentence semantics, but also prove that the model has good accuracy and robustness.
e experimental results of each model under the Wiki-QA dataset are shown in Table 4. Compared with the current related research, the model results are superior to most baseline models [40,41]. Comparing the results of line 1 and line 5 of Table 4, it can be seen that the stacked BiLSTM model is much more accurate than the single-layer LSTM model. In addition, the best experimental results of the model compared with the model in [42], the average accuracy is 0.05% higher than the model in [42].  In the field of intelligent question answering, these data results confirm that the model has some excellent performance in the statement semantic capture representation of questions and answers and can better represent semantic features.

Conclusion
In this paper, we proposed a stacked BiLSTM neural network based on the coattention mechanism for question answering. Stacked BiLSTM is used to sentence semantic understanding and modeling; coattention mechanism and attention mechanism are utilized to obtain the codependent representation of questions and answers; the combination of cosine similarity and Euclidean distance is used to calculate the similarity between the question and the answer. As reported in Section 4.2, we conduct experiments on the datasets of TREC-QA (8)(9)(10)(11)(12)(13) and Wiki-QA, and then experiments on the TREC-QA (8)(9)(10)(11)(12)(13) dataset demonstrated that the best MAP (0.7613) and MRR (0.8401) are achieved by using our model. We obtained a certain degree of improvement in MAP (0.83%) and MRR (0.79%) compared with other optimal baselines. Experimental results show that the proposed model is efficient for question answering. Note that, the experiment was only tested on two small datasets. e future work would focus on the implementation of replacing the original coattention mechanism with dynamic coattention network plus (DCN+) and incorporating CNN into the model to improve the experimental results. In addition, the implementation of the proposed model in other large-scale datasets such as SQuAD and SemEval-cQA will be an important issue for the next work.

Data Availability
is work involved data from the Text REtrieval Conference (TREC) 8-13 datasets and Wiki-QA datasets. We used the 53417 Q&A pairs in TREC 8-12 to train the model, while using 1148 Q&A pairs and 1517 Q&A pairs in TREC 13 for development and testing, respectively. All researchers can access the data in the following site: http://nlp.stanford.edu/ mengqiu/data/qa-emnlp07-data.tgz, https://www.microsoft. com/en-us/download/details.aspx?id�52419. e data are divided into train data and development/test data.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.