QA4PRF: A Question Answering based Framework for Pseudo Relevance Feedback

Pseudo relevance feedback (PRF) automatically performs query expansion based on top-retrieved documents to better represent the user's information need so as to improve the search results. Previous PRF methods mainly select expansion terms with high occurrence frequency in top-retrieved documents or with high semantic similarity with the original query. However, existing PRF methods hardly try to understand the content of documents, which is very important in performing effective query expansion to reveal the user's information need. In this paper, we propose a QA-based framework for PRF called QA4PRF to utilize contextual information in documents. In such a framework, we formulate PRF as a QA task, where the query and each top-retrieved document play the roles of question and context in the corresponding QA system, while the objective is to find some proper terms to expand the original query by utilizing contextual information, which are similar answers in QA task. Besides, an attention-based pointer network is built on understanding the content of top-retrieved documents and selecting the terms to represent the original query better. We also show that incorporating the traditional supervised learning methods, such as LambdaRank, to integrate PRF information will further improve the performance of QA4PRF. Extensive experiments on three real-world datasets demonstrate that QA4PRF significantly outperforms the state-of-the-art methods.


I. INTRODUCTION
Query expansion plays a key role in information retrieval as it tries to find proper terms 1 to revise the original query so as to better represent the user's information need [1]. Many methods have been proposed to select expansion terms. Some of them need the relevance scores of documents according to the given query, which are methods of relevance feedback, such as Rocchio's algorithm [2,3]. Another branch of methods, known as pseudo relevance feedback (PRF) [4], assumes that the top-retrieved documents are relevant to the original query, while the others are irrelevant. These "pseudo" relevant documents are then used to reformulate the original query by expanding new terms. Compared to methods with relevance feedback, PRF methods are more practical in realworld applications, as the ground-truth relevance scores are not always available.
There are various PRF methods, which can be categorized into relevance-based models [5,6,7], divergencebased models [8,9], information-based models [10,11,12], matrix factorization-based models [13], supervised learningbased models [14] and word embedding methods [15,16]. Nonetheless, we argue that existing PRF models are insufficient since they only consider the terms of high occurrence frequency in top-retrieved documents or of high semantic similarity to the original query. All of them neglect to understand the content of documents in a human-comprehension way, which is indeed very important to perform effective query expansion to reveal the user's underlying information VOLUME  need. This kind of contextual interaction information should be taken into account to improve accuracy and interpretability when expanding the query. For example, a user issues a query "How are Oscar winners selected?". After the firstround retrieval, the term "film" appears 53 times in the top 10 retrieved documents, which is much more than other words (except stopwords). As such, most existing PRF models will select "film" to expand the original query, but the fact is that the term "film" has nearly no effect of improving the retrieval performance. With analysis on top-retrieved documents, it is easy to find that "voter" is the best answer as the expansion term for this query, which can increase the mean average precision (MAP) value by about 10%. However, "voter" only appears 7 times in top 10 retrieved documents. This example shows the importance of understanding the content of topretrieved documents in the PRF task.
In the natural language processing field, machine reading comprehension (MRC) [17], as a framework for question answering (QA) task proposed in 2016, actually provides a high potential method to address this problem. In a QA system, the MRC framework tries to comprehend the question and corresponding passage or contexts and outputs one or several spans of words in the passage as the answer to the question. Inspired by MRC, in this paper, we formulate PRF as a QA task: as for PRF, the goal is to find the most effective terms (analogous to the "answer" in QA) in each top-retrieved document (analogous to the "passage" in QA), for expanding the original query (analogous to the "question" in QA). The analogous relationship between PRF and QA is illustrated in Figure 1. With the MRC framework, it is promising to make the PRF model generalize to work on diverse queries and their retrieved documents.
Multi-head attention [18] and bi-direction attention [19] are widely used deep learning architectures in QA tasks [20], which can capture the global contextual information among long word sequence effectively. As the output of QA is a subset of its input, pointer network [21] and its variants are widely used and shown to be highly effective in MRC frameworks [22,23]. Regarding PRF as a QA task, it is natural to introduce the attention-based pointer network from QA to PRF, aiming to find the most relevant terms from topretrieved documents for a specific query.
However, applying an attention-based pointer network alone in PRF would neglect some useful terms with high occurrence frequency in the top-retrieved documents. Ignoring such statistical PRF information may lead to query topic drift problems [24]. Therefore, we treat this circumstance, which totally ignores semantic information, as a special case for QA4PRF. To address this issue, we incorporate a supervised learning module 2 to estimate the importance of each term from the aspect of statistics, which acts as input 2 In our framework, we use LambdaRank [25] as the supervised learning method since it is representative and effective learning to rank method with simple implementation. Other supervised learning methods can also be incorporated in our framework without significant modifications. To demonstrate the superiority of our proposed framework, we conduct extensive experiments on three search datasets, where two are public benchmarks, and the other is proprietary. The results show that QA4PRF significantly outperforms the state-of-the-art methods in terms of mean average precision (MAP), normalized discounted cumulative gain (NDCG), and precision at top-retrieved documents. The ablation study further validates the effectiveness of each component of QA4PRF.
To sum up, the main contributions of this work are as follows: • To the best of our knowledge, we are the first to formulate PRF as a QA task and propose a novel QA4PRF framework for query expansion. QA4PRF manages to understand the content of top-retrieved documents to find better expansion terms than existing methods. • In QA4PRF architecture, an attention-based pointer network is leveraged to learn embedding of each term, considering global contextual information. To further utilize statistical PRF information, we propose to leverage a supervised learning model such as LambdaRank to further enhance the performance of QA4PRF framework. • Extensive experiments on three search datasets demonstrate that QA4PRF achieves significantly better performance than state-of-the-art methods.
The rest of this paper is organized as follows. First, we discuss the related works in Section II. Then, Section III elaborates the details of the proposed QA4PRF framework. Extensive experiments and results analysis are presented in Section IV. Finally, we conclude this paper in Section V.

II. RELATED WORKS
Pseudo relevance feedback (PRF) models are widely used in query expansion and have been shown to be effective [2,5,8,10,13,26,27]. PRF models can be divided into semanticbased, statistics-based and hybrid models according to different sources of input information. In this section, we review these methods separately.

A. SEMANTIC-BASED PRF MODELS
Considering the semantic information of query and documents, semantic-based PRF methods adopt word embedding models to generate latent representations of words, and therefore queries and documents. With such latent representation, terms which are most similar to the query are selected for expansion. Roy et al. [15] proposed to apply kNN based methods to retrieve the most similar terms with respect to a query. Kuzi et al. [16] utilized the cosine similarity of embeddings to expand the query with terms that are semantically relevant to the query as a whole or to its terms.
Obviously, these embedding methods provide global representations of terms but ignore to comprehend the content of top-retrieved documents, which make it hard for the model to generalize to different queries and top-retrieved documents. In our work, we propose an attention-based pointer network to capture contextual interaction information to address this issue.

B. STATISTICS-BASED PRF MODELS
Statistics-based PRF models assume that the most frequent terms in top-retrieved documents are the best words to expand the query. Such statistical information includes term frequency, inverse document frequency, document length, etc.
Relevance-based models [5,6,7] assume that terms in query are generated by a relevance model P (w|R) (where R denotes the relevance class). RM3 [6] and RM4 [5] provide different approaches to estimate such a relevance model. Based on RM3 model, RM3 + [7] takes inverse document frequency of terms into consideration.
Information-based models [10,11,12] select the most informative terms to expand the original query. As in stated [10], the information of a term in a document can be defined as the statistical difference between situations when the term is in such document and in the whole collection. Based on this, Montazeralghaem et al. [11] introduced extra term proximity constraints such that a term that appears near a query term, has a higher weight. Recently, Montazeralghaem et al. [12] raised more interdependence relationships to complete existing constraints.
Divergence-based models expand the terms which make the expanded query and relevant documents similar while leading expanded query and the whole collection to be dissimilar. DMM [8] implements this idea through KLdivergence. As a followed up work, MEDMM [9] improves DMM by introducing an entropy term as a regularizer, to resolve the skewed feedback issue of DMM.
Matrix factorization-based methods [13] treat query expansion as a recommendation problem and establish a document-term weight matrix. Matrix factorization techniques are then used to reformulate the original query by filling the document-term weight matrix.
As can be observed, the aforementioned statistics-based models are all unsupervised learning methods. The work of Cao et al. [14] is the only statistics-based method utilizing supervised learning model, which a discriminative model (e.g., support vector machine in this work) is learned to judge whether a term should be chosen for query expansion.
Although these statistical methods can improve the performance of query expansion, all of them totally neglect the contextual interaction information in top-retrieved documents. In our proposed QA4PRF, we formulate PRF as a QA task and apply machine reading comprehension, as a framework for QA tasks, to solve the PRF problem. For the special case which completely ignores any semantic information, we incorporate LambdaRank [25] to integrate statistical PRF information to improve the performance of our framework.

C. HYBRID PRF MODELS
Although semantic-based methods can improve retrieval performance after expanding the query, several works [15,16] have pointed out that utilizing semantic information alone, such models cannot achieve comparable performance of statistics-based approaches. Due to this observation, a hybrid PRF method is proposed by Kuzi et al. [16], which makes use of both statistical and semantic information. The experiment results show that the semantic information learned by word embedding model improves the performance of RM3 [6] in some cases.

D. SUMMARY
In this paper, we propose a QA based framework for PRF, named as QA4PRF, where PRF is viewed as a QA task. The main differences between our framework and previous works are: • Borrowing idea from QA, an attention-based pointer network is used to learn embedding of each term, capturing contextual interaction information among long word sequence. • To deal with the special circumstance that to utilize statistical PRF information, LambdaRank, a pair-wise learning to rank model with ranked list information, is incorporated to our work reasonably.

A. OVERVIEW
Pseudo relevance feedback (PRF) methods are widely adopted in query expansion as it needs no ground-truth relevance scores, which are usually unavailable in industrial IR scenarios. In PRF methods, the top-retrieved documents according to a given query are assumed to be to-some-extent relevant. PRF methods select terms from such "pseudo" relevant documents, referred as candidate word set, to expand the original query so as to improve the retrieval performance.
In this work, we recast PRF as a question answering (QA) task to find relevant terms ("answers" in QA) in each topretrieved document ("passage" in QA) for a specific query ("question" in QA). This section elaborates the details of our proposed QA4PRF framework. The used notations are summarized in Table 1 for the ease of presentation.

Notation Description Q
The query D i The i-th top-retrieved documents w A term in the candidate word set in general The i-th term and its word embedding in query The i-th term and its initial embedding in document D j e i , e i The i-th expansion term and its initial embedding M Number of feedback documents N Number of feedback terms t Q w , v Q w Term frequency and its normalized form of w in query Q t D w , n D w Term frequency and its normalized form of w in document D iw Inverse document frequency of term w C Number of documents in the collection Cw Number of documents contains term w avg l Average document length Feature vector of term w with respect to query Q W QA (w) Weight of term w from QA aspect Final expansion weight of term w Θz, bz Weight matrices and biases of hidden layer β Feedback coefficient γ Trade-off between pointer network (QA aspect) and statistical PRF The pseudo relevance feedback task considered in this paper, is defined as follows. A query consists of n words Q = {q 1 , q 2 · · · , q n } and top-M retrieved documents are denoted The output of PRF is a list of terms E = {e 1 , e 2 , · · · , e N } from those in the original document set D. These N terms are used to expand the original query. In the following, we will use bold letter to denote the embedding vector of each term.
In QA4PRF architecture, the attention-based pointer network learns the importance of terms in candidate word set with respect to the query, considering global contextual information. The importance of each term is decided by the semantic relationship between the query and this term, which considers the content of documents at the same time. Then, to utilize statistical PRF information, we construct a feature vector for each word in candidate word set and introduce LambdaRank [25] as a ranking model to predict importance of each word. Finally, an interpolation method is used, which incorporates the result of LambdaRank to the attention-based pointer network, so as to enhance the performance of QA4PRF. The details of these components are presented in the following subsections.

B. ATTENTION-BASED POINTER NETWORK
Instead of generating word embedding with only local cooccurrence relationship in fixed size windows of context in traditional word embedding models, we adapt attention layer to capture the global contextual information among long word sequence efficiently and effectively. Furthermore, a pointer network [21] is used to restrict the output of query expansion, which is a set of terms, to be a subset of its input, just like in MRC. The details of attention-based pointer network are described as follows (shown in Figure 2).

1) Attention Layer
The attention layer is composed of a multi-head attention layer and an attention flow layer. First, the multi-head attention layer is utilized as the embedding block, which is used by most of the existing MRC models. The inputs of this layer include initial embeddings of terms in a top-retrieved document and the original query. Following Vaswani et al. [18], the initial embedding of each term is set as the sum of word embedding and positional encoding. The word embedding is initialized from the 300-dimensional FastText vectors [28]. The positional encoding has the same form with Transformer [18], in order for the model to utilize the order of sequence. Such multi-head attention layer aims to learn embeddings of terms in such a query or document, with the consideration of the relationship between the target word and other words in such a query or document.
The input embeddings can form three matrices Q, K (with the number of columns dim K ) and V, similar to the Transformer [18]. And the output of one attention block can be presented as The output of multi-head attention is to concatenate the result of each attention block in parallel. To make readers easier to understand, we illustrate our model with a single document D u and one attention block, where D u denotes the u-th top-retrieved document. In a document attention block, the matrices Q, K and V are defined as where m denotes the length of document D u for convenience, d u,1 , d u,2 , . . . , d u,m denote the initial embeddings of terms in document D u , and W Q , W K and W V are the weight matrices. Similar to Equation 1, the embedding block of document D u is specifically formulated as 3 where t u,i is the embedding of d u,i after the attention block. Then, a two-layer feed-forward network is used, which can be formulated as o u,i = MLP(t u,i ). MLP denotes the feedforward network and o u,i is the embedding of d u,i after the multi-head attention layer. Here we only show the attention block of document D u due to space constraints. Such block for query is similar. We represent these embedded vectors of terms Q = {q 1 , q 2 , . . . , q n } after the multi-head attention layer as {r 1 , r 2 , . . . , r n }.
After elaborating the details of multi-head attention, let us present the attention flow layer, which is a Query-Doc attention. In this layer, we compute attention in two directions following Seo et al. [19]. This module is commonly used in many previous machine reading comprehension models such as [20,29]. Such attention block enables each word in query to attend over all words in each top-retrieved document. For convenience, we indicate the input of this block as the document D u = [o 1 , o 2 , . . . , o m ] and the query Q = [r 1 , r 2 , . . . , r n ], which are the output of the multihead attention layer. Firstly, we compute a similarity matrix S ∈ R m×n to represent the similarities between each pair of query and document term following Seo et al. [19]. Then we can use matrix S to obtain the attention weights in both directions, namely Doc-to-Query attention and Queryto-Doc attention.
Doc-to-Query attention is utilized to denote which query terms are the most relevant to each word in a top-retrieved document. We can calculate the attention weight by normalizing each row of matrix S by applying the softmax function as a i: = softmax(S i: ). The output matrix A ∈ R m×d can be computed as A i: = j a ij r j . Therefore, A contains the attended query vectors for a top-retrieved document.
Query-to-Doc attention indicates which term in document is the most relevant to each word in query. So the attention weight can be obtained as b = softmax(max column (S)) ∈ R m . Then the attended vector matrix of terms in document is Finally, the output of Query-Doc attention is computed by applying the average pooling as E = Avg(A, B, D T u ) ∈ R m×d .

2) Pointer Network
In PRF task, the expansion terms are chosen from the corresponding document, which is the input of PRF model. This is to say, the output of our model is a subset of its input. Due to this reason, we realize a constraint on the output of attention layer with the pointer network, as is often used in QA techniques [30]. The probability of expanding each term in the u-th top-retrieved document D u is defined as where q = 1 n n i=1 q i is the embedded vector of the query. Here, P pointer (d u,i |Q, D u ) means the expansion probability (the output of pointer network) of the i-th word in the u-th top-retrieved document D u for a specific query Q.
In the expansion process, the output probability of a candidate term is defined as the summation of the weights of this word in top-M retrieved documents 4 by pointer network as

3) Training
To train the attention-based pointer network, the label of expanding word w for query Q needs to be defined. We define the term with the largest ∆ Q,w NDCG for query Q as "positive" word, and others as "negative" words, where ∆ Q,w NDCG represents the NDCG promotion after expanding query Q with term w. The network is trained by the cross entropy loss as where L is the loss of attention-based pointer network and y(d u,i ) ∈ {0, 1} denotes the label of term d u,i . To overcome the difficulty of deep network training, we employ a residual connection and layer normalization in the end of each multihead attention block following Vaswani et al. [18].
As mentioned in Section I, applying MRC framework alone in PRF may neglect statistical PRF information, thereby leading to several problems, such as query topic drift [24]. To handle this issue, we integrate statistical PRF information in QA4PRF to enhance the performance of the framework.

1) Statistics Feature Vector
In order to leverage statistical PRF information, we construct feature vectors for terms from the aspect of statistics. In PRF, the top-retrieved documents are assumed to be relevant to the original query. The candidate word set includes terms (except stopwords) from top-M retrieved documents of the query. For each term w in the candidate word set, a feature vector of w with respect to query Q, i.e., FV(w, Q), is constructed as As presented in Equation 7, the feature vector consists of three factors: • v Q w is the normalized term frequency of w in query Q, normalized by term frequency summation over all the terms in Q as where t Q w is the term frequency of w in Q.
• i w is the inverse document frequency of w in the whole documents collection: where C is the number of documents in the whole collection and C w is the number of documents containing term w in the whole collection. • n Du w is the normalized term frequency of w is a topretrieved document D u [10]: where avg l is the average document length in the collection and α is a hyper-parameter. Factor v Q w and n Du w reflect the statistical information of term w in query Q and document D u , while factor i w reports statistical information of w in the whole collection. Equation 8 and Equation 10 consider the length of query and document in different ways. The reason is that queries have almost the same length which is much smaller than documents. This means a small change in length affects the terms in document much less than in query. Therefore, we use average documents length and log function in Equation 10 to estimate the normalized form.
After generating feature vectors from aforementioned statistical PRF information for each term, we perform supervised learning methods to learn and predict the importance of each word, with respect to the query. In our framework, we apply LambdaRank [25] because it is an effective learning to rank method with easy implementation. As stated earlier, other supervised learning methods can also be adopted in our framework without significant modifications.

2) LambdaRank
In PRF task, the objective is to improve the retrieval performance after expanding terms. For a specific query Q, we aim to improve NDCG by expanding word w with the help of LambdaRank. Such the lift of NDCG after expanding term w is denoted as ∆ Q,w NDCG . We apply a two-layer neural network to predict the probability of each term in candidate word set to expand as P lamda (w|Q) = sigmoid Θ 2 ·relu(Θ 1 ·FV(w, Q)+b 1 )+b 2 .
(11) The training process of LambdaRank in PRF is presented as follows. First, the candidate word set are categorized to a relevant word set and an irrelevant word set. Similar to the intuition of PRF, relevant word set consists of N words that bring the largest NDCG 5 promotion after expansion and irrelevant word set includes the rest. Then, a pair of words w i , w j is selected, such that w i is selected from relevant word set randomly and w j is chosen from irrelevant word set randomly. That is to say, ∆ Q,wi NDCG > ∆ Q,wj NDCG . As 5 N is the number of query expansion terms, a hyper-parameter which will be discussed in Section IV. Other evaluation metrics such as MAP, ERR are also feasible. NDCG | to denote the difference of NDCG promotion when making different choices between w i and w j . Similar to LambdaRank, we take such difference into consideration in the loss function as The importance of each term in the candidate word set from the aspect of statistics is defined as ranking score of this term by LambdaRank as

D. FINAL WORDS SELECTION
We incorporate the result of LambdaRank to the attentionbased pointer network with linear interpolation as shown in Figure 3. The weight of term w i as for query Q can be estimated by QA4PRF framework as where γ ∈ [0, 1] is a hyper-parameter to trade-off the additional part and the; attention-based pointer network. According to the weight W (w i |Q) of each term w i , we sort terms in the candidate word set in descending order and select the top N terms to expand. As above mentioned, the query will be expanded by such N terms. To achieve this goal, we define P (w|Q) as the maximum likelihood estimate (MLE) of term w with respect to query Q, such that P (w|Q) = term frequency of w |Q| . The query Q is updated to Q by expanding term w which selected by weight W (w|Q) as where β ∈ [0, 1] is feedback coefficient, a hyper-parameter 6 to make a trade-off between the original query and the expansion terms. P (w|E, Q) is the expansion score of term w for a specific query Q. For convenience, we set P (w|E, Q) = 1 for N expansion terms and P (w|E, Q) = 0 for other words.

IV. EXPERIMENTS
In this section, we perform extensive experiments 7 on three real-world datasets to evaluate our proposed framework. We aim to answer the following research questions (RQs):

1) Datasets
We conduct experiments on three real-world datasets, where two are public benchmarks and the other is private. TREC 8 is an English benchmark dataset. We use the data from TREC robust track 2004 collection. The document collection is from TREC Disks 4 and 5. OGeek 9 data comes from a subscenario of OPPO mobile search ranking optimization. This is a Chinese dataset with shorter document length (compared to TREC). Queries are entered by users when searching on mobiles. Documents only contain the title of each page. Private is collected from the user search logs in a mainstream App Store. Queries are entered by users when searching apps on mobiles. Each document is an app in the App Store. To summarize, TREC is a dataset with full documents. The other two are collections with shorter document length. Detailed statistical information of these datasets is shown in Table 2. Following the previous works [11,13], we only use the title field in TREC to represent each query. All documents are tokenized and stemmed using stemmer with NLTK toolkit [31] for TREC or Jieba 10 for OGeek and Private. After that, punctuation and stopwords are removed in each document. In all experiments, FastText [28] is used to generate initial word embeddings. The pre-trained embed-ding models can be downloaded from web 11 . As for retrieval function, we utilize BM25 12 [32,33], which is a simple yet effective ranking method in information retrieval. The hyper-parameters of BM25 are decided by cross validation on TREC. As mentioned above, documents in OGeek and Private are much shorter, so we splice top-M documents for more accurate results.

2) Baselines
As stated in [15,34], the performance of word embedding (semantic-based) methods is not effective compared to statistical methods for pseudo relevance feedback. Therefore, we omit semantic-based methods in overall performance comparison, but include them in ablation study to compare their performance with attention-based pointer network of QA4PRF. To compare overall performance, we mainly include statistical and hybrid approaches, totally 11 baselines. We categorize such baselines into different classes, without going into details of how they work (the detailed discussion is presented in Section II).

3) Parameter Setting
The

4) Evaluation Metrics
To evaluate the performance, we leverage mean average precision (MAP), normalized discounted cumulative gain (NDCG) and precision of top-retrieved documents. For TREC, following previous works [12,13], we select top 1000 documents to evaluate MAP and NDCG, select top 20 11 https://fasttext.cc/docs/en/crawl-vectors.html 12 The reason why we utilize BM25 as the retrieval function is because our paper focuses on the query expansion model in PRF tasks instead of the retrieval model. BM25 is a simple yet effective method commonly used in previous PRF works [7,13]. VOLUME 4, 2016 documents to evaluate precision. However, for two datasets with shorter documents, OGeek and Private, we report MAP, NDCG and precision of top 5 documents.
To illustrate the robustness of models, we utilize robustness index (RI) [35] which is defined as (n + − n − )/|Q|, where n + /n − represents the number of queries have better/worse NDCG performance after query expansion and |Q| denotes the total number of test queries. Obviously, a higher RI means more robust.
Furthermore, the Wilcoxon signed-rank test [36] has been conducted to demonstrate that the differences between our proposed framework and the strongest baselines are significant.

B. OVERALL PERFORMANCE (RQ1)
In this subsection, we compare the performance of our proposed QA4PRF with several PRF baselines. Table 3 reports the overall performance of all models on three datasets, where underlined numbers are the best results of baselines and bold numbers indicate the best results of all models.
From Table 3, we can conclude that our proposed framework achieves the best performance in the three real-world datasets. Specifically, in these three datasets, compared with the best baseline, QA4PRF obtains the promotion with 0.42%, 1.09% and 1.66% in terms of NDCG (1.15%, 1.16% and 2.40% in terms of MAP, 4.12%, 0.89% and 1.58% in terms of precision), respectively. It demonstrates the superiority of our framework over baselines, in both English and Chinese datasets, with various lengths and numbers of documents. The wilcoxon signed-rank test shows that significant improvement over these metrics are achieved by our method. Besides, the results of RI demonstrate that our proposed QA4PRF is more robust than other baselines in most circumstances. Further, all the PRF models outperform NoPRF in the three datasets according to table 3, which indicates the effectiveness of PRF methods by using the information in "pseudo" relevant documents for query expansion. Information-based models (such as LL, LL(pro), LL(ALL)) perform better than other baselines like relevancebased (e.g., RM3, RM4 and RM3 + ) and divergence-based (e.g., DMM and MEDMM) under most circumstances. Such findings are consistent with the results and claims in the previous studies [11,12,37].
Moreover, to further prove the effectiveness of our proposed QA4PRF, we divided queries of TREC into 5 categories (biology and medicine, legal theory, news, international relations, science and technology) according to the user's intent. The specific query classification method is provided in the code. Table 4 shows the result of each category. It is obvious that, for most cases, QA4PRF has a 0.7% to 2.4% improvement in terms of MAP compared with the best baseline, except for category legal theory. Even in the query of legal theory category, our proposed model can get almost the same performance as the best baseline. Such results illustrate the superiority of our proposed QA4PRF over baselines for diversified queries.  In addition, to study how QA4PRF performs, we present the training loss, testing loss and testing performance (MAP@1000) in each iteration on TREC in Figure 4. Specifically, Figure 4(a) shows the training process of attentionbased pointer network for the first test set in cross validation, while that of LambdaRank is displayed in Figure 4(b). For both methods, we report the best parameter settings. It is obvious that both methods achieve stable performance after about 7 iterations. Extensive studies of these two methods are in Section IV-D.

C. HYPER-PARAMETER STUDY (RQ2)
Our proposed QA4PRF has several key hyper-parameters which may affect the performance of framework, i.e., (i) number of feedback documents M , (ii) number of feedback terms N , (iii) feedback coefficient β and (iv) trade-off γ between the attention-based pointer network (QA aspect) and LambdaRank (statistical PRF aspect). In this subsection, to study the impact of these hyper-parameters on our proposed framework, we tune one of them while fixing the others.
Specifically, we set the number of feedback documents M (in Section III-C) as 10 and the number of feedback terms N (in Section III-D) as 60 which are common settings in the existing PRF methods. For the feedback coefficient β (in Equation 15) and trade-off γ (in Equation 14), we fix them as 0.1 and 0.5 by cross validation. Figure 5 presents the experiment results of hyper-parameter study in terms of MAP in TREC dataset. For each hyper-parameter, we have the following observations.
• Number of feedback documents M : The proposed framework performs better when M is enlarged from 5 to 10. The best performance is achieved when M = 10. When M is larger than 10, the performance of the model keeps dropping as M increases from 10 to 100. This risingfalling phenomenon on the performance is reasonable. When more feedback documents are considered, more terms are included in the candidate word set, so that the chance of expanding query with useful terms is larger, which leads to performance improvement. However, involving too many documents introduces noisy words in the candidate word set, which results in expanding query with useless terms and therefore worse performance. "Rel. Impr." presents the relative improvement of our proposed method over the best baseline; "P@k" represents the precision of top k documents.  Figure 5(b), the MAP trend with the number of feedback terms is rising-falling, which reaches its peak when N = 60. Expanding more terms increases the chance of formulating useful queries to fit user intent, which helps to boost the performance. However, expanding too many terms results in adding noise to the query and mismatching with user intent, so that the performance is degrading. • Feedback coefficient β: β shows the trade-off between original query and expansion terms. As can be observed in Figure 5(c), the framework obtains the best performance when β = 0.1 and the worst performance when β = 0. Note that β = 0 is actually the ranking without query expansion, namely, NoPRF. It validates the effectiveness of query expansion with PRF, which is consistent with an observation in Section IV-B. When β > 0.1, MAP value drops slowly, which indicates that the PRF model can help improve retrieval efficiency of the original query, but can not completely replace the original query (i.e., β = 1). • Trade-off γ: γ shows the balance between attention-  based pointer network (QA module) and LambdaRank (statistical PRF). It is obvious that the best performance is achieved when γ = 0.5, which validates that it is effective to considering semantic QA and statistical PRF information simultaneously.

D. ABLATION STUDY (RQ3)
In QA4PRF, there are two components may affect the framework performance: attention-based pointer network (in Section III-B) which formulates PRF as a QA task and Lamb-daRank (in Section III-C) which leverages statistical PRF information to enhance the performance. In this subsection, to study the effectiveness of each component, we evaluate the performance of these two components, compared with state-VOLUME 4, 2016 of-the-art baselines over TREC dataset.
To demonstrate the superiority of attention-based pointer network, several state-of-the-art semantic-based approaches are chosen as baselines, such as Cent, CombSUM, CombMNZ, CombMAX in [16] and kNN-embed in [15]. Besides, we also contain a QA baseline, QANet [20], which performs better than other MRC framework. To validate the effectiveness of LambdaRank, RM3 + , MEDMM, LL(ALL) and SVM are selected as baselines from the ones described in Section IV-A, as they are the best performed ones. The performance comparison is presented in Table 5 and Table 6, respectively.
One can observe that Attention-based pointer network (referred as "Atten-pointer" in Table 5) is performed to capture contextual interaction information among long word sequence. The results in Table 5 show that such a network achieves much better performance for expanding query than semantic-based baselines. Such results show potential for other QA techniques in PRF task. Compared to QANet, a strong baseline in QA, our attention-based pointer network also gets better performance. This indicates that our model is more suitable for PRF tasks than general QA models.
LambdaRank is used to learn the importance of each term based on feature vectors from the aspect of statistics. The results in Table 6 show that LambdaRank outperforms all the baselines in terms of NDCG, MAP and precision. Such results demonstrate the effectiveness of our statistical PRF component, compared to the state-of-the-art statistical approaches. Furthermore, as two supervised learning models, the comparison with SVM shows the superiority of applying LambdaRank, which also indicates the potential of trying other learning to rank models in statistical PRF part. Compared with semantic PRF approaches, the effectiveness of statistiscal methods is slightly superior in terms of NDCG and MAP. This observation is consistent with the findings in [34].
From the above two observations, it can be concluded that solving PRF problem with QA techniques can bring considerable improvement. Incorporating statistical PRF information, at the meantime, to a certain extent can effectively enhance the improvement of our framework.

E. CASE STUDY (RQ4)
To take a deep look into the characterization of selected terms by QA4PRF, we randomly pick two queries from TREC dataset and present the results of LL(ALL) and QA4PRF in Table 7. Terms in this section are all shown with their stemmers which has been discussed in Section IV-A. Table 7 shows top 20 expansion terms selected by LL(ALL) and QA4PRF respectively. Bold terms indicate the different terms selected by such two methods. As can be observed, both LL(ALL) and QA4PRF can provide several terms with high term frequency which are useful for retrieval. However, LL(ALL) may include noisy terms with high term frequency in top-retrieved documents as well. For example, "life" and "peopl" in query "Modern Slavery", "spokesman" and "countri" in query "Diplomatic Expulsion" are not informative. QA4PRF, on the contrary, is able to find useful and informative words for the given queries, e.g., "worker", "govern" and "labor" represent main objects and reasons for "Modern Slavery", "russian", "iraq" and "attach" are the answers for "where and why frequent Diplomatic Expulsion occur?". To study such different selected terms by the two methods quantitatively, we take the query "Modern Slavery" as an example and show the term frequency and MAP promotion of top 20 terms which selected by LL(ALL) and QA4PRF respectively in Figrue 6. Each square in Figure 6 represents a term in Table 7. The expansion terms are arranged in descending order from left to right and top to bottom according to the score given by each model, corresponding to Table 7. Red squares highlight the different terms between two methods. Term frequency of a word is the total number of its occurrences in top 10 documents. It is obvious that LL(ALL) selects expansion terms with higher term frequency, compared to QA4PRF. According to the comparison of MAP promotion based on two models, we can see that, even though QA4PRF selects 5 different terms with less term frequency, such 5 terms lead to much better MAP improvement, compared to the ones selected by LL(ALL). This is achieved by managing to understand the content of top-retrieved documents to find expansion terms by QA4PRF.

V. CONCLUSION
In this work, we formulate pseudo relevance feedback (PRF) as a question answering (QA) task and propose a novel QA4-based framework for PRF called QA4PRF to utilize contextual information in documents, where the objective is to find some proper terms to expand the original query by utilizing contextual information. In QA4PRF framework, an attention-based pointer network is utilized to understand the top-retrieved documents in a human-interpretable way. Such a network is efficient and effective in capturing contextual interaction information among long word sequences in machine reading comprehension. Besides, we find that incorporating the traditional supervised learning methods, such as LambdaRank to make use of statistical PRF information further enhancing the performance of QA4PRF framework. Extensive experiments over three real-world datasets demonstrate that QA4PRF framework significantly outperforms all state-of-the-art PRF models.
For future work, we plan to investigate reinforcement learning solutions to perform multi-step query reformulation in (pseudo) relevance feedback scenarios. In addition, extending QA4PRF for query reformulation in sponsored search would be the potential to improve the platform rev-enue when considering the auction competitiveness of each candidate term.
WEINAN ZHANG is now a tenure-track associate professor at Shanghai Jiao Tong University. His research interests include reinforcement learning, deep learning and data science with various real-world applications of recommender systems, search engines, text mining & generation, knowledge graphs, game AI etc. He has published over 80 research papers on international conferences and journals and has been serving as a (senior) PC member at ICML, NeurIPS, ICLR, KDD, AAAI, IJCAI, SIGIR etc. and a reviewer at JMLR, TOIS, TKDE, TIST etc. YONG YU is a professor in Department of Computer Science in Shanghai Jiao Tong University. His research interests include information systems, web search, data mining and machine learning. He has published over 200 papers and served as PC member of several conferences including WWW, RecSys and a dozen of other related conferences (e.g., NIPS, ICML, SIGIR, ISWC etc.) in these fields. VOLUME 4, 2016