Evaluating the Effectiveness of Query-Document Clustering Using the QDSM Measure

It is well documented that the average length of the queries submitted to Web search engines is rather short, which negatively impacts the engines’ performance, as measured by the precision metric. It is also well known that ambiguous keywords in a query make it hard to identify what exactly search engine users are looking for. One way to tackle this challenge is to consider the context in which the query is submitted, making use of query-sensitive similarity measures (QSSM). In this paper, a particular QSSM known as the query-document similarity measure (QDSM) is evaluated, QDSM is designed to determine the similarity between two queries based on their terms and their ranked lists of relevant documents. To this extent, F-measure and the nearest neighbor (NN) have been employed to assess this approach over a collection of AOL query logs. Final results reveal that both the Average Link Algorithm and Ward’s method present better results using QDSM than cosine similarity.


Introduction
This paper is an extension of work originally presented in the International Conference of the Chilean Computer Science Society (SCCC) [1]. Nowadays, exist a large amount of information available on the Web; additionally, web search engines (WSE) daily index thousands of pages, by which finding relevant and timely information over this growth without constraints becomes quite a challenge. According to [2], this unrestricted growth has not been accompanied by corresponding technical advances in approaches to extract relevant information. Unsuccessful searches are common in WSEs and can be given by several reasons among which we can mention the following. First, the lengths of submitted queries by users are mostly short (e.g., the average size of a web search is 2.4 words [3]). Owing to queries are conformed by a few keywords, it is complicated to determine the specific topic in which the query is inserted, therefore, a result with a few keywords in the query can be formed by unrelated topics. Furthermore, users ignore how to formulate a correct query [4]. The problem is made more complex when the user does not have a specific idea about which should be the result that he/she is looking for (i.e., which turns in relevant information for him/her). In light of the foregoing, it is not easy for WSEs to interpret the meaning of what users are looking for. One way to tackle this issue, is to consider the context in which the queries are submitted at WSEs [5] [6]. To capture the context, WSEs should consider what queries are related among them (i.e., determining if the queries are similar) and how their results have been beneficial for users. A way to determine the relationship between similar queries and relevant documents for these queries is given by the cluster hypothesis, which establishes that all documents considered as relevant for a query are similar to each other (i.e., similar documents can be relevant for the same query) [7]. Accordingly, it can be assumed that relevant documents for a query q, are relevant for a query q , such as q is similar to q. Thus, having methods that provide the similarity among documents and queries can bring a better characterization about the meaning of a new query, and as a consequence, it entails more effective results.
Whether the WSE is able to establish how similar is a new query regarding queries recently submitted, then the search engine should provide documents, which were relevant in previous searches. Hence, the recent past queries along with the relevant documents provide a context, in which is feasible to improve the answers to Aiming to capture the context in which queries are submitted, Tombros and Van Rijsbergen [9] present a pioneering approach introducing the measuring called query-sensitive similarity measure (QSSM). This measure establishes the following; two documents are more similar than others, whether both are more similar regarding a given query. Following this argumentation line, QSSMs can be used as a metric to measure the similarity between two queries considering the context. As such, a WSE using additional information can improve its effectiveness to answer a new query. To achieve this goal, queries alongside their documents should be stored in clusters. Currently, few approaches store the queries along with their documents [10][11] [12][13] [14] (these approaches recommend to the user a list of similar queries, which are related to the submitted query by the user). However, the approaches mentioned previously are not directly related to the approach presented in this paper.

Contribution
The main contribution of this paper is the effectiveness evaluation of QDSM. Roughly speaking, effectiveness is related to the quality of recovered documents. Better effectiveness occurs when more relevant documents are retrieved (from a total of N documents retrieved, there are more relevant documents than non-relevant documents). By contrast, worse effectiveness occurs when more non-relevant documents are retrieved. With this in mind, grouping similar queries alongside their relevant documents (in clusters) should directly affect the effectiveness. Therefore, improving clusters' effectiveness implies having more relevant documents by clusters, which aligns with the cluster hypothesis.
To evaluate QDSM effectiveness, the F-measure alongside the nearest-neighbor (NN) cluster hypothesis tests were used. Both tests were applied over the following algorithms; Single Link, Complete Link, Average Link, Bisection K-means, and Ward's method. Three relevance models were simulated with the aim to determine which documents are relevant for a specific query. Besides, a wide range of experiments was carried out over five sets of queries. Final results are presented as QDSM improvement percents regarding cosine measure (S c ) (or cosine similarity), which were contrasted with the values obtained applying the Student Paired T-test (two samples).
The remainder of this paper is organized as follows: In Section 2, a detailed review of articles related to the problem of capturing the context using clustering is presented. In Section 3, the methodological description, is exposed. Section 4 presents the experimental environment. Section 5 displays the empirical results, which are then discussed in Section 6. Finally, Section 7 gives some perspectives of future work along with the conclusions.

Related Work
There have been many works that deal with the use of clustering in Information Retrieval (IR). Clustering in IR has been employed to improve the effectiveness (i.e., quality of clusters). Overall, clustering-based approaches that intend to capture the context in which the queries are submitted can be classified into two categories, considering the underlying repositories (these are also known as collections or datasets). The first category involves using traditional IR datasets. Some of them use QSSM similarities such as [15], [16], [17]; meantime, the second category uses log-file data from search engines.
To improve effectiveness in the retrieval process, an approach relies on hierarchic query-specific clustering is presented in [15]. In pursuing this goal, a wide range of experiments was performed. According to the authors, given a specific query, the hierarchy should be adapted to increase the likelihood to situate relevant documents to the query in nearby clusters. Two characteristics stand out in this research. First, an analysis of optimal clusters variation considering the number of top-ranked documents allowing better effectiveness is exposed. Finally, a comparison between their results and inverted file search (IFS) is provided. Five traditional IR collections alongside four hierarchic agglomerative methods were employed in all experiments. Final results indicate that query-specific clustering outweighs static clustering in each of the experiments. On the other side, a framework based on probabilistic co-relevance, which gives a query-sensitive similarity, is presented in [17]. The similarity between two documents corresponds to the ratio between the co-relevance probability and a specific query. Two cases were considered to identify the co-relevance. First, the document's relevance is independent of the rest of the documents. Second, the document's relevance is dependent on the rest. Several experimental scenarios were studied using the nearest neighbor test on TREC collections. The final results reveal that the framework outperforms term-based similarity.
The approaches mentioned above expand the users' judgments grounded on the following assumption. All terms included in a relevant document for a specific query are relevant too. Consequently, it is assumed that all documents that include some of these terms are also relevant. Besides, these approaches do not deal with the similarity among queries, with the exception of [18] [16]. In [18], a method called Scatter/Gather, which explores clusters based on documents, is proposed. The method returns a ranked title's list for the organization and viewing of retrieval results. Scatter/Gather is used as a tool for retrieval of browsing results, which presents summaries to users. Towards that goal, documents are joined in similar topics. A fractional algorithm provides k clusters on TREC/Tipster dataset. As a result of experimentation, the authors assert that their method gives tailored clusters according to the query's characteristics. In such a way, their results corroborate the cluster hypothesis since relevant documents are more similar to each other than non-relevant documents. In [16], the authors introduce the Weighted Borda (WBorda) model, which determines the co-relevance of a document using different similarities' types. To this end, a Support Vector Machine (SVM) was trained to get the www.astesj.com https://dx.doi.org/10.25046/aj050201 estimated co-relevance, fusing the induced rankings using several functions. Each function considers the similarity between documents and the query. Several similarity measures were considered in experiments such as cosine BM25, M1, and M3. The final results in tasks such as nearest-neighbor clustering, cluster-based, and graph-based document retrieval indicate that WBorda provides better results than several proposed co-relevance models.
On the other hand, approaches such as [19] [10], and [21] belong to the second category. Users' log-based, an approach of query clustering is proposed in [19]. Towards that end, the documents previously read by users are employed to construct cross-references among documents and queries. According to the researchers exist a strong relationship between the selected documents and queries. This approach underlies two fundamental aspects: First, two queries are similar if users clicked on the same documents; Second, if a set of documents was selected for the same queries, then the documents' terms are related to the queries' terms. The empirical results were obtained using the DBSCAN algorithm and the Encarta encyclopedia dataset. The final results show that many similar queries are gathered in the same clusters utilizing this approach. A query-clustering classification, which compares various query similarity measures, is presented in [10]. Three groups: contentbased approaches, feedback-based approaches, and results-based approaches are suggested in this classification. In content-based approaches, the common terms of queries are used to describe query clusters. Similarity functions such as Jaccard, Cosine, and Dice were employed to build the clusters. In that regard, the authors claim that this method is not convenient for search engines due to many queries have few terms. On the other side, in feedback-based approaches, the similarity measure is grounded users selections over search results; therefore, two queries are similar whether they encourage the selection of similar documents. In turn, results-based approaches evaluate the similarity between queries through the overlap of returned documents. In this case, the researchers point out that this approach's principal drawback corresponds to high processing times. Notable results are obtained using the three approaches in parallel. In [21], a WSE provides a user with a list of similar queries regarding the user's submitted query. Semantically similar queries give support to the clustering process. Clusters are formed, taking into account the historical preferences of registered users in the WSE. To build the clusters, term-weight vector representation of queries considering the clicked URLs was employed. The method exhibits two benefits, (1) it discovers the related queries, and (2) sorts the queries rely on a relevance criterion. It is important to mention that the queries are sorted using the following criteria: (a) the similarity between the clusters' queries and the new query and (b) the support, which is related to how much the query answers capture the user interest. The experiments were conducted using the combination of (a) and (b). The results display improvements on average precision.
In summary, the first category is based on traditional IR datasets. A traditional IR dataset is formed by three sets, a set of documents (D), a group of queries (Q), and a set of users' judgments (JU). The user's judgments contain the relevant documents for a query in Q. Note that all works mentioned in this category include new relevant documents (if some document has some relevant term, then it is considered relevant), which are not part of the original JU. On the other hand, it should be noted that there are no JU in the second category (log-files from search engines). Consequently, subject matter experts evaluate the pertinence of a document given a query. Note that all approaches mentioned in this related work modified some documents' relevance, which directly impacts effectiveness. Contrary to these approaches, in this paper, three types of users' judgments are simulated without altering the documents' originalrelevance.
The overall procedure and a discussion about the results are presented in the following sections.

Methodology
The methodology overview is as follows. Initially, a user submits a query to the WSE. Thereafter, the WSE returns the documents as a result of the query. These documents are ranked from the most similar to the least similar regarding the query. Once this is done, the documents are stored along with the query in clusters inside the WSE. Aiming to form the clusters considering documents and queries, QDSM is used. In this way, when a user submits a new query, it is contrasted with the past queries (these are the queries previously stored) in the query-document clusters. Accordingly, an effectiveness improvement should occur due to the clusters closest to the new query containing relevant documents for the new query. Document relevances become a crucial factor in enhancing effectiveness. In a traditional IR dataset, document relevances are given by subject matter experts, who determine what documents are relevant given a query. These documents are reflected in the users' judgments. On the other hand, the relevance of documents in a WSE is given by ranking functions. Overall, ranking functions attempt to capture the relevance through users' clicks on documents, which are expressed in the ItemRanks. In this manner, a retrieved document (i.e., URL or web page) with an ItemRank high could be considered as relevant.
It is essential to keep in mind that most approach clustering-based extend or use subject matter experts to give relevance to the documents, due to none of these datasets have been designed to work with similar queries (past queries are part of clusters). According to [22], a good way to tackle this problem is by using simulation. In this paper, document relevances have been simulated. Two notable advantages are obtained with the simulation use. First, it is neither necessary to use subject matter experts nor extend the users judgments. Second, several models of relevance (a model can be seen as a ranking function) can be used; for instance, given a query, a document can be relevant or non-relevant depending on the model. In this paper, this is given by a relevance function, which determines the relevance of a document considering both its corresponding ItemRank and relevance probability.
Aiming to shed light on how QDSM is evaluated using a relevance function, suppose the following example. Assume that three www.astesj.com https://dx.doi.org/10.25046/aj050201 documents (d 5 , d 10 , and d 12 ) have been recovered for a query q, such as d 5 is the most similar document concerning the query. The respective ItemRank for each document is 20, 25, and 30. Likewise, the probabilities of relevance according to their respective ItemRanks are 90%, 40% 70%. Additionally, suppose a relevance function that only considers the last recovered document (in this case, d 12 ). As d 12 has an ItemRank of 30, the probability of being relevant is 70%. To simulate the relevance probability of d 12 , a binary array of 100 elements is used. Initially, this array is instantiated with 0 values; subsequently, 70 random positions with value 1 are assigned in the array using Uniform Distribution. In order to give the relevance to d 12 , an array position is selected using Uniform Distribution; thus, if this value is 1, then d 12 is relevant; in another case, d 12 is non-relevant (Note that d 5 and d 10 are non-relevant). Following the same example, suppose a relevance function that assigns the relevance individually, then the same procedure to give relevance is performed for each document (d 5 , d 10 , and d 12 ). Thus, a possible result could be that d 5 and d 12 being relevant, meantime d 10 could be non-relevant. Finally, suppose a relevance function that provides the average relevance, then the average of ItemRanks is obtained, and its relevance probability is used to give relevance to the three documents.
Note that different relevance functions could provide different results on QDSM, due to QDSM considers the relevant documents as part of its metric.

The Query-Document Similarity Measure
The Query-Document Similarity Measure (QDSM) is a Query-Sensitive Similarity Measure (QSSM), which has as a fundamental purpose to capture the semantic similarity between queries, taking into consideration terms that belong to the queries as well as the position in which appear the relevant documents in both lists. Indirectly, the terms associated with the relevant documents should contribute to providing context. Specifically, each list of documents is presented in descending order according to the similarity of documents regarding the query. From the semantic point of view, two queries are closer if they share more relevant documents in their lists. This can be appreciated by observing the number of relevant documents that form the intersection between the two lists. Therefore, while more relevant documents make up the intersection, the more similar the queries will be. Thus, this paper's primary assumption is that using similar queries alongside their relevant documents should provide clusters with better effectiveness than S c , since additional information can be captured from documents, including the queries (i.e., information is not complete in each query individually). Specifically, this additional information is given by the union of queries and documents' terms but does not belong to the intersection among them. Using this rationale, QDSM is in line with the cluster hypothesis, which claims that relevant documents for a particular query tend to be close, whereby these relevant documents should tend to be in the same cluster for a specific query.
QDSM takes advantage from the place in which relevant documents appear on the list. As reported by [23], the most similar documents concerning the query tend to appear at the beginning of the list. On this basis, the order in which relevant documents appear in both lists gives information about the context (particularly the terms of relevant documents). QDSM deals with the order of relevant documents through the use of the Longest Common Subsequence (LCS) algorithm. LCS allows acquiring the relative similarity keeping the order in which simultaneously appears a relevant document in both queries. By doing so, the context capturing in which the queries are submitted is possible.
This measure is convenient in two situations: • When terms of a query are few, as is currently happening in the WSEs.
• In a dynamic environment, where the documents' relevance could change (i.e., the position of a document in the list could change as well as its ItemRank), non-relevant documents could become relevant documents.
Accordingly, the queries are either short length (i.e., few keywords in the query) or ambiguous. Nevertheless, these can be enriched with more information associated with their relevantdocuments retrieved.
Aiming to give formality, some definitions are detailed below. Definition 1. Let D be a set of documents, such as every document in D, is formed by a set of terms (i.e., words contained in d). D is stored in a WSE W. Besides, let q be a single query, such as q ∈ Q, where Q is a set of queries interpretable by W. Definition 2. The cosine measure between q and d i , is defined as: shuch as d i ∈ D ∧ q ∈ Q, t q are the query'terms and t i are the document's terms. Definition 3. Let |d i | be the number of terms in a document d i , such as d i ∈ D. Note that this definition can be applied to obtain the number of terms for a query q.
Then, there are no terms in common between q and d i . Hence, applying the Def- 0 ∀d i ∈ D} be the set of documents whose similarity with q ∈ Q is different to 0.
Definition 5. Let L N (q) be a list of N retrieved documents from W, such as L N (q) is ranked by decreasing order (i.e., they are ordered from highest to lowest according to S c ).  (1) where M corresponds to the relevance model (i.e., PartialRel, AvRel or LastRel). ItemRank is a function that provides the rank for the document d i , and M is the function that gives the probability considering M and ItemRank. A binary array formed by 100 elements is used to represent the probability in M. The Uniform Distribution is employed to instantiate the values (the percentage is represented with values 1 in the array) and determine the relevance (if the array's selected position contains a 1, then the document is relevant).
Definition 7. Let L N,R (q) be a list of retrieved documents along with their relevances, then: Definition 8. Given two queries q and q such as both queries are in Q, and their corresponding lists of documents L N,R (q) and L N ,R (q ). Then QDSM is defined as follow: where: • S c (q, q ) corresponds to the cosine measure between the queries q and q .
• LCS (L N,R (q), L N ,R (q )) is the LCS algorithm applied over L N,R (q) and L N ,R (q ) [24].
• max gives the greatest number of relevant documents between the lists L N,R (q) and L N ,R (q ).
Lemma 2 asserts that exists at least a common document in both lists, and therefore at least there is one term in common among queries and the document.
Proof. Suppose |d| > |q|, which implies that d has at least one term more than q. On the other hand, can ocurr that |q| > |d|, then q has at least one term more than d. Thus, ∃t ∈ ((d − q) ∪ (q − d). Therefore, ∃t ∈ (d q).
Lemma 3 points out that if different numbers of terms form the document and the query, then at least there is one term that does not belong to the intersection between them. Note that d and q are in L N,R (q). Proof.

Hola
Definition 9. Let RQ be a set of queries, along with their retrieved documents and their corresponding relevances, then RQ is defined as follow: Definition 10. Let D be a benchmark query set, which is formed by Z subsets of queries, such as: www.astesj.com https://dx.doi.org/10.25046/aj050201 5 Figure 1: QDSM measure Theorem 2 ensures that the context in which the queries are submitted in the WSE can be captured by the complementary terms to both queries and their relevant documents (i.e., the symmetric difference of sets d, q and q ((d q) q )). An example of the Theorem 2 essence and how QDSM is computed, is displayed in Figure 1. In Figure 1, the common terms for both queries are presented in bold (i.e., these are t 2 and t 3 ), note that both terms are common in d 2 , which is a relevant document (i.e., all document d i , 1 are relevant meanwhile d i , 0 are non-relevant). LCS is applied over both lists of retrieved documents considering only the relevant documents in the lists (i.e., even though d 7 is in both lists, only d 2 is considered). Finally, all terms that give context are in blue color (i.e., t 1 , t 4 , t 10 and t 15 ).
It is worth noting that QDSM takes the value 1 (see Definition 8 ) when q and q are the same queries, and all retrieved documents are relevant. Specifically, this latter can be itemized in two parts. In the first part, S c (q, q ) provides 1 because q and q are the same. In the second part, both results lists are equal, and all retrieved documents are relevant; therefore, both lists hold the relevant documents in the same positions.
In summary, QDSM provides a metric that captures the semantic relationship between two queries (context), considering the relevant documents' order in both lists. A wide range of experiments was conducted in order to compare the effectiveness between S c and QDSM. The experimental setup is displayed in the following section.

Experimental Environment
A benchmark query set extracted from the well-known dataset of query log "AOL Query Logs Dataset (AOL) [25]", was used to carry out the experiments. This collection has more than 20 million web query logs stored, submitted by around 650 thousand users in more than 36 thousand lines of data. These queries were stored at an interval of three months in the year 2006. Broadly speaking, queries in AOL are depicted as rows in the database files, which contains five columns with the following fields: {AnonID, Query, QueryT ime, ItemRank, ClickURL}, where: • AnonID: an anonymous user ID number.
• Query: the query submitted by the user in the WSE.
• QueryTime: the exact time at which the user submitted the query.
• ItemRank: if the user clicked on a result, it keeps the rank of the selected document; holds empty otherwise.
• ClickURL: The domain portion of the URL is showed as a result if the user clicked on a search result.

The Benchmark set of Related Queries
Aiming carrying out the clustering experiments, a benchmark set of related queries (RQ) (see Definition 8.) was processed randomly from AOL. To verify that the queries were partially related, each time a query was chosen, it was checked that at least existed another query, in such a way S c was neither one nor zero. To achieve this goal, the queries with ClickURLs empty were removed due to these do not have answers associated with the queries. Furthermore, stop-words processing was previously performed before to apply S c . The core insight is that ClickURLs allow depicting a list of retrieved documents for q (see Definition 5.). It should be noted that register with the same query q (i.e., the same terms), logged by the same user around the same time, corresponds to a single query, which was split into several registers. Providing the maximum amount of information implies to use the longest session, which at least contains one register (i.e., at least one result or document).
On the other side, it is important to highlight that AOL does not possess users' judgments. Note that the users' judgments play a fundamental role in order to know what documents are relevant for a specific query [26]. Furthermore, these relevant documents are necessary to evaluate precision, recall, and, therefore, effectiveness.
To tackle this issue, users' judgments were simulated following the approach presented by [27]. Simulations of relevance judgments are presented in the following section.

Simulation of Relevance Judgments
Simulating document relevance regarding a query is not a trivial task. This task embraces a great variety of aspects, such as users' literacy [28], needed information at any one point of time, and the user's profile [29] among others. To address this problem, the approach proposed by [27], which provides the relevance probabilities for documents depending on their ItemRanks on AOL, is applied in this paper. Towards that end, F(ItemRank(d i ), M) (see Definition 6.) is simulated using M(M, ItemRank(d i )) in Definition 7. In simple words, the relevance is assigned using a value 0 (non-relevant) or 1 (relevant), which is obtained considering the values presented in Table I (i.e, M in F(ItemRank(d i ), M)). The relevance probabilities were calculated using the ItemRanks, assuming the user clicks provide information about how the user interprets the query [30].
In Table I, two relevance models are presented by "AllRel" and "LastRel" columns. "AllRel" implies all clicked documents are considered as relevant; meantime "LastRel" reports that only the last clicked document is relevant. Regarding "AllRel", two variants were used for it. The first variant is named "PartialRel", which considers the individual ItemRank of each document obtained from average is 30. Subsequently, the relevance probability is determined by the "ItemRank" (average) row and the "AllRel" column, so for this example, the relevance probability for each document is 0.5106. Although the three documents have the same relevance probability, the relevance for each document is individually obtained.

Clustering Experiments
Five algorithms were evaluated considering the three relevance models over the same D. Five sets of RQ were used; the smallest set of RQ contains 123 queries alongside their documents (i.e., for each document, the relevance has been assigned), meantime the biggest set comprises 2,141 queries. Aiming to compare the clusters' quality between S C and QDSM, two well-known measures have been used, F-measure and the nearest neighbor (NN) cluster hypothesis test. F-measure was proposed by [31]; the idea behind this measure is to evaluate effectiveness in the post-processing step, in which each cluster is assigned to a class. The F-measure can be seen as a way of combining the precision and recall for a retrieval specific model, and it is defined as the harmonic mean of the model's precision and recall. In simple words, F-measure has as purpose to provide a binary classification as positive or negative according to the belonging of objects to determined classes in the clusters. F-measure allows giving more importance to precision, recall, or both. On the other hand, the nearest neighbor (NN) cluster test (which is also well-known as the (NN) test) was proposed by Voorhees ([32], [33]). In simple terms, the (NN) test reviews each of the retrieved documents for a specific query, identifying how many of its n close neighbors are relevant. The (NN) test is also used as a non-parametric classification and regression technique.
Turning towards the cluster hypothesis, QDSM should provide better effectiveness than S c if it is possible to find more relevant documents per cluster. Each experiment was executed ten times, and results are displayed as percentages of increasing or decreasing of QDSM regarding S c . Specifically, F-measure was used giving the same weight for precision and recall; meantime, The (NN) test was instantiated with value three in all experiments. In addition, the Students Paired t-Test (Two Samples test) was used to support the results. Five algorithms Single Link, Complete Link, Average Link, Bisection K-means, and Ward's Method, were used in each experiment.
All experiments were carried out on a server with: Intel Xeon Processor E3-1220 3.00 GHz; 16 GB Ram memory of 2133 MHz; 1 TB 7200 RPM Hard Drive; and Linux Operating System Debian Jessi 8.4.

Experimental Results
In this section, the quality of clusters (effectiveness) produced by QDSM and S c is compared. To achieve this goal, the F-measure and the (NN) test were used considering the Single Link, Complete Link, Average Link, Bisection K-means, and Ward algorithms. Effectiveness was obtained using the relevance models PartialRel, AvRel, and LastRel. Note that the number of documents considered in the (NN) test corresponds to 3, it means that the relevances of the three closest documents with respect to the query were evaluated. To do that, the similarities between documents are checked alongside their relevance regarding the query. Overall, all results presented in each Table corresponds to QDSM, which are expressed in terms of percentages regarding the S c . In Table 2, the QDSM effectiveness over the three relevance models was evaluated using the Single Link Algorithm. Note that the three relevance models were tested considering five sets of queries (# of q). From this Table, it is possible to appreciate that there is not an improvement of QDSM concerning S c . Furthermore, Single Link presents better effectiveness for S c than QDSM; this is consistent with the p-value obtained using The Student Paired T-test (two samples), which was 0.0029 for this Table. Generally speaking, Single Link exhibits the best results considering the "AvRel" relevance model, following by "LastRel" and finally "PartialRel" model. Continuing the same trend, the results for F-measure are exposed in Table 3. Here S c shows again better results than QDSM, which is in line with the p-value: 0.00036, and notably the best results are presented by "AvRel" relevance model.
Regarding the Complete Link algorithm, similar results to the Single Link algorithm are presented in Tables 4 and 5, where S c has better results than QDSM. In Table 4, the best results are provided by "LastRel" relevance model using the (NN) test, meantime that the best results using the F -measure are given by "PartialRel" model. The p-value for the (NN) test was 0.720, whilst the p-value for the F -measure was 0.00003.
On the other hand, the Average Link algorithm displays different results to Single Link and Complete Link algorithms, where QDSM is better than S c . In Table 6, the best results are given by "PartialRel", followed by "LastRel" and "AvRel" respectively. Results showed in Table 6 are in line with the p-value: 0.00041. Following the same trend, in Table 7, QDSM presents better results than S c for F-measure, which is in accordance with the p-value: 0.000002. It should be noted that there is no substantial difference between "AvRel" and "LastRel".
In turn, the results for the Bisection K-means algorithm are exposed in Tables 8 and 9. Both Tables provide conflicting results since  Table 8 two relevance models ("PartialRel and AvRel") give good results for QDSM, meantime these relevance models provide opposing results in Table 9. These latter results are coherent with their respective p-values. The p-value for To sum up, considering both measures, the three relevance models, and the p-values obtained, the best results are provided by Average Link algorithm and the Ward's method.

Discussion
The main reason traditional IR datasets were not used; it was because autonomous queries build them. As mentioned early, a traditional IR dataset is made up of a set of documents D, a set of queries Q, and a set of users' judgments JU. In a strict sense, evaluating the effectiveness considering similar queries (queries form part of the clusters) means having a set of similar queries (Q') for Q. Note that Q must consider a set of users' judgments (JU ). Indeed, the effectiveness of two similar queries should be different. Accordingly, these datasets are not suitable for evaluating approaches based on similar queries because they have neither Q nor JU . On the other hand, approaches based on log files employ subject matter experts to extend and evaluate whether a document is relevant or non-relevant given a query. In short, the extension of relevant documents is common in both types of collections. In this manner, aiming to avoid using subject matter experts or extending document relevances using relevant terms, three models of relevance have been simulated in all experiments.
Concerning the effectiveness evaluation for both measures (S c and QDSM), it is noteworthy that F-measure has been widely used in several approaches, which deal with post-retrieval clustering. Nevertheless, the use of this measure provides two drawbacks. The first one is that the result associated with this measure comprises the number of relevant and non-relevant documents related to recall and precision in its mathematical formula. Thus, the initial effectiveness changes its value once new documents are considered relevant in the post-retrieval process. The second one refers to how the clusters are conformed taking into account the different classes of objects that these contain. Consider that objects can belong to predetermine classes, and the ideal situation is given when the clusters are formed only by objects of the same class. Two terms well-known in the cluster evaluation reflect this situation, homogeneity and completeness. The idea behind homogeneity is that each cluster has few classes; meantime, completeness intends each class to be contained in a few clusters. Thus, two-cluster forming using the same objects and the same classes can have the same F-measure, while their homogeneity and completeness are different. In turn, like F-measure, the NN-test has been extensively used in several approaches to assess effectiveness. Nevertheless, this measure is not sensitive to homogeneity and completeness, since it contemplates the direct search of the n-nearest neighbors. Hence, this test is more appropriate to corroborate the cluster hypothesis, which considers the relevant documents that form the clusters.
Regarding the results presented in section "Empirical Results", it is essential to point out that there is no significant difference between values provided by F-measure and the (NN) test, excepts for the Complete algorithm (Table 4 and 5), in particular for "AvRel" where for the (NN) test, Table 4 presents favorable results for QDSM in contrast to Table 5. Besides, the p-values for both Tables differ. On the other hand, exists a substantial difference with some results presented by [1]. In particular, regarding the relevance models used in that research. There, the relevance "AllRel" is used considering the proposed by [27], meantime in this research "AllRel" has been modified by "PartialRel" and "AvRel", it means that no all documents have been considered relevant such as occurs in [1]. It is important to point out that it is unlikely that all recovered documents are relevant, such as happens in the real world. Nevertheless, the Average Link algorithm presents interesting results in both works. Concerning the results provided by the algorithms in this research, the best results are provided by Average Link and Ward algorithms using both tests (F-measure and the (NN) test). The main Ward characteristic is that it minimizes the variance of the objects belonging to a particular cluster using the "error sum of squares". In this way, each cluster should tend to have objects of a few classes (relevant and non-relevant ). Carried to the hypothesis cluster context should have a clear separation between clusters of relevant documents and clusters of non-relevant documents. Therefore, the nearest closest neighbor of a relevant document should be relevant too. On the other hand, the distance (S c or QDSM) between two clusters for the Average Link Algorithm is determined as the average distance between each object in one cluster to every object in another cluster., by which it is feasible to avoid extreme measures obtaining more homogenous clusters. The latter is in contrast with www.astesj.com https://dx.doi.org/10.25046/aj050201 the way to built clusters in Single and Complete algorithms. Finally, Bisection K-means is a hybrid approach between agglomerative and hierarchic clustering. This algorithm exhibits favorable results in the (NN) test except when the relevance is "LastRel", recall that in this case, only the last recovered document could be relevant.
Although the running times escape from the scope of this paper, it is worth noting that most time complexities are not high. To obtain the time complexities is necessary to consider visiting a distance matrix (i.e., one matrix for S c and QDSM respectively) with the aim to find the n nearest-neighbor. Furthermore, calculating LCS implies to visit another matrix with M files and N rows. Note that M corresponds to a list of retrieved documents for a query q, whilst N is another list of retrieved documents for a query q . Therefore, LCS takes O(MN). Recall that LCS is used to evaluate QDSM. According to [34], the optimal implementation of Ward based on the algorithms, nearest neighbor chain, and reciprocal nearest neighbor, takes O(N 2 ). In turn, the time complexity of Bisection K-means algorithm is O(N 2 log 2 N). As for the implementation of Single Link algorithm takes O(N 2 ) in time complexity [35]. On the other hand, Complete Link implies O(N 2 log 2 N) [36]. The time complexity for the Average Link algorithm takes O(N 2 log 2 N) [37]. It is important to mention that both matrixes of distances are previously built before using each clustering algorithm.

Conclusion and Future Work
This paper is intended to check the quality of clusters (effectiveness) built using Query-Document Similarity Measure (QDSM). To achieve this goal, the F-measure and the nearest neighbor (NN) test were used to evaluate clusters' quality. The clusters of documents were built using the AOL Query Logs Dataset. In order to provide relevance to the documents, three variants related to the ItemRanks over recovered documents were simulated. Extensive experimentation was carried out using the algorithms Single Link, Complete Link, Average Link, Bisection K-means, and Ward. According to results obtained, applying the nearest neighbor (NN) test, QDSM presents significant results using the Average Link, Ward, and Bisection K-means. On the other hand, in accordance with the results obtained by the F-measure; Ward and Average Link algorithms provide better results using QDSM than Cosine Similarity (S C ). The best results are provided by the Average Link algorithm, followed by Ward's method using QDSM, considering the three variants of relevance. Ideas for future research comprises the comparison between QDSM and other state-of-art measures.

Conflict of Interest
The authors declare no conflict of interest.