Applying Similarity Measures to Improve Query Expansion

The huge evolving in the information technologies, especially in the few last decades, has produced an increase in the volume of data on the World Wide Web, which is still growing significantly. Retrieving the relevant information on the Internet or any data source with a query created by a few words has become a big challenge. To override this, query expansion (QE) has an important function in improving the information retrieval (IR), where the original query of user is recreated to a new query by appending new related terms with the same importance. One of the problems of query expansion is the choosing of suitable terms. This problem leads to another challenge of how to retrieve the important documents with high precision, high recall, and high F measure. In this paper, we solve this problem through applying different similarity measures with the use of English WordNet. The obtained results proved that, with a suitable selection method, we are able to take advantage of English WordNet to improve the retrieval efficiency. The work proposed in this paper is extracting the terms from all the documents and query, then applying the following steps: preprocessing, expanding the query based on English WordNet, selecting the best terms, weighting of term, and finally using the cosine similarity and Jaccard similarity to obtain the relevant documents. Our practical results were applied on the DUC2002 dataset that contains 559 documents distributed over several categories. The average precision of cosine (for random queries) = 100% whereas the average precision of Jaccard = 84.4 %, and the average recall of cosine = 86.8% whereas the average recall of Jaccard = 73.4%. The average f-measure of cosine = 92%, whereas the average f-measure of Jaccard = 76%.


Introduction
Information Retrieval (IR) is dealing with the retrieval and display of the information of interest.The user can arrive to the concerned information by the information retrieval system.Usually, user's information is represented by means of a query.Therefore, many challenges might meet the IR system, one of which is the problem of vocabulary mismatch [1].To treat this problem, the Automatic Query Expansion (AQE) was proposed by some researchers in the IR field.The goal of this technique is recreating the original query by appending new terms to it to obtain better results.Cui et al. classified the AQE techniques into two main classes: global analysis and local analysis [2].The techniques of the global analysis class are independent from the main query or its result.In general, they use external knowledge sources to choose items for expansion, such as WordNet or thesaurus, whereas the local analysis class creates a new query depending on some retrieved documents of a previous search, for example relevance feedback [3].Appending new terms to the main query can happen before either the primary search or the relevancefeedback search [4].The IR System consists of three elements [5], namely the documentary database, the query subsystem, and the matching mechanism.
Refining the effectiveness of the information retrieval system depends on applying some techniques on it.One of these techniques is the query expansion [6].The big data available in the Web has not been accompanied by techniques for retrieving the relevant data [7].Usually, the search on web data does not produce relevant results because of four reasons; first, the words written by the user on the search engine are belonging to several topics, thus the results of the search do not give a clear result.Second, the shortness of the query may cause an ambiguity of what the user wants [8].Third, the user does not know what he/she is searching for.Fourth, some users do not have the ability of formulating the suitable query [9].

Related Works
In 2006, Radwan et al. introduced a new function of fitness and compared their results with the genetic algorithm dependent on classical IR and cosine fitness function in the problem of query learning.Their function was applied to CISI, CACM and NPL.These three famous test collections were used to obtain a complete view of improving IR systems using genetic techniques [5].In 2014, two methods of query expansion were proposed by Brandao.The first method is an unsupervised entity-oriented query expansion, which chooses terms expansion using taxonomic features innovated by the semantic structure.The second method includes techniques of machine learning so as to choose and rank the entities oriented for query expansion [10].In 2014, Jain et al. suggested a technique that investigates the function of graph structure for query expansion and determines the significance of each node in the graph using WordNet.The most important nodes which represent the word senses were specified and appended to the original query [11].In 2015, a method of query expansion for short queries on the Web was proposed by EI Ghali et al..This method used the Latent Semantic Analyses (LSA) technique which is dependent on the context of Hussain Iraqi Journal of Science, 2021, Vol. 62, No. 6, pp: 2053-2063 2055 the query.Three methods of query suggestion were used to extract the context from the search engine, namely the cosine similarity, the language models, and their fusion [12].In 2018, Jabri et al. suggested a similarity measure using the query graph.This measure calculates the similarity between candidate terms and the initial query, text mining techniques, and explicit semantic analysis (ESA) measure [13].The work proposed in this paper is extracting the terms from all the documents and the query, then applying the following steps: preprocessing, query expansion based on English WordNet, selecting the best terms, term weighting, and finally using the cosine similarity and Jaccard similarity to obtain the relevant documents.

Basic Concepts of Query Expansion System
The system of generating query expansion consists mainly of several steps.Next sections illustrate these steps.

Preprocessing
Preprocessing is a language dependent process.The main function of this step is to extract the character sequences form data set that increase user's original query, along with performing tokenization and linguistic preprocessing on them, while the same steps are applied to the user query.

Query Expansion
The technique of this step targets at appending additional related tokens to the main queries to improve the effectiveness of IR systems [6].QE has an effective function in refining the information retrieval (IR), where the main query is updated to a new query by appending new related items with same importance.There are several types of query expansion techniques, as summarized in Figure -1.

Figure 1-Types of Query Expansion
One of these techniques is based on WordNet which is used by this proposed work.WordNet is a lexical dictionary for several languages.The identical terms from several languages are linked by using synsets (set of senses).WordNet is used to get the equivalent terms in any language that verifies the user's information need.Hence, the synonyms' terms were added to the query.Voorhees et al. [14] used WordNet for query expansion and reported negative results, where equivalent words were appended to the query.He noticed that this method produces a little difference in retrieval efficiency if the main query is formed very well.Smeaton et al. also used WordNet along with Point of Sale (POS) tagging for QE.The interesting point in this work is that it ignored the terms of the original query after the process of expansion [15].

Selection of the Best Terms of Query
In this step, choosing the best terms of query was done because the technique of query expansion makes more numbers of expansion tokens, but actually, these large tokens do not reflect the actual numbers of important tokens.Normally, a few numbers of expansion tokens are chosen since the Hussain Iraqi Journal of Science, 2021, Vol. 62, No. 6, pp: 2053-2063 2056 effectiveness of the IR system becomes better when the expansion tokens are few.This selection is dependent on the existence, or not, of that term in the documents; if the original term exists in the documents then this term is assigned a weight one, and if the synonym term exists, it is assigned a half one.

Weighting and Ranking of Query Terms
In this important step of the system, ranks and weights of each query expansion tokens were calculated.In this step, the input is represented by the best terms of query selected from the previous step.The weight of tokens refers to the relevancy of tokens in the expanded query, which is then used in ranking the retrieved documents based on relevancy.Term frequency-inverse document frequency (TF-IDF) is used in this paper as a weight measure of the individual tokens in both the expansion query and the data source.TF-IDF measure is used to compute the weight of each item in the data source or in a query.This weight represents the importance of that item dependent on the number of times it appears in the documents.
To compute the weight of term (t) in a document (d), we must follow equation ( 1 where: TF(t,d) is the occurrences count of term (t) in document (d).DF(t) is the documents count containing the term (t).N is the count of documents in the data source.

Similarity Measures
A similarity measure is the measure of how much alike are two objects.It can be used to calculate similarity between two queries, two documents, or one document and one query.The two measures which are used in this work are cosine similarity and Jaccard similarity.

Cosine Similarity Measure
The cosine similarity measure between any two data sets or two vectors is a measure that computes the cosine angle between them.This measure is used for orientation and not magnitude.It can be seen as a comparison between documents on a normalized space because the angle between documents is taken into consideration besides the magnitude of each word count (TF-IDF) of each document [17].This measure is represented between (d1) and (d2) as shown in equation ( 2

Jaccard Similarity Measure
This measure is used to compute the similarity between two nominal attributes or between two sets by finding the intersection of these attributes or sets and dividing it by their union.Jaccard similarity between two sets A and B, the, i.e.JS(A, B), is represented as the size of their intersection divided by the size of their union.This is a very convenient measure as it is bounded between 0 and 1; JS(A, B) = 0 if and only if A∩B = ∅, and JS(A, B) = 1 if and only if A = B.It has gained recent interest in its applications for finding documents (or web-pages) that are very similar but not the same, as well as in plagiarism detection.[18] Mathematically, equation (3) clarifies the Jaccard measure.
The Evaluation of The Proposed System The two most common measures for information retrieval performance are precision and recall [19].The evaluation of the proposed system is necessary because it measures the performance of this system.These measures are explained in equations 4 and 5. Precision (P) is the fraction of retrieved documents that are relevant: Besides the measures explained above, there is another measure which is the F measure; it is the weighted harmonic mean of precision and recall, as shown by equation ( 6).F-Measure = 2 * ( ) ………………………………………………….…………......( 6)

The Proposed Methodology
The process of the proposed system consists mainly of the following steps: preprocessing of data sources and query, query expansion depending on WordNet, term selection, term weighting, and ranking documents according to a score calculated through the Cosine and Jaccard similarity measures to obtain the relevant documents.Next sections illustrate the basic stages of this system.4.1 Query Preprocessing: this stage includes four steps (tokenization, normalization, stop words removal, and stemming)  Extraction the text from the documents: extraction the entire texts from the documents and the query.
 Tokenization: the process of dividing the whole text into words. Removing stop words: removing the words which are used frequently like articles, adjectives, prepositions, etc.  Word stemming: the procedure of restoring stems of the words.4.2 Query Expansion: this step is concerned with finding the synonyms for the individual terms of a query.This operation is achieved by using the WordNet.Simply, this database contains, for each word in English language, its corresponding synonyms (synsets).Some words may have a lot of synonyms, thus a pruning operation is required to reduce them.

Synonym
Selection: the aim of this step is to select a list of synonyms from the whole list of the synonyms for a specific term relying on the absence or presence of this synonym in the documents; if it is present in any of the documents in the collection, the synonym will be chosen, else it will be ignored.

Hussain
Iraqi Journal of Science, 2021, Vol.62, No. 6, pp: 2053-2063 2058 4.4 Term Weighting: after all the previous steps, the term weighting step is responsible of the calculation of the weights of the remaining terms of a query by using TF-IDF weighting measure.Algorithm (1) shows the four previous steps.

Calculating Similarity:
this is the last stage in the proposed system where scoring was applied through two similarity measures, i.e. cosine similarity and Jaccard similarity, applied for each query document pair.Algorithm (2) demonstrates an algorithm to retrieve the relevant documents of the query.

The Experimental Results
The proposed system used the summarized datasets (DUC2002) which is a free documents data source that contains 559 documents in multiple subjects such as Nature disasters, Politics, Middle-East, Sport, Health, etc.This data source went through the four stages of preprocessing and then will be saved into text files to be ready for the next step.This work uses C sharp or C# programming language, one of several languages that exist in visual studio 2015, as a tool for solving the problems of the practical part of paper.Also, it runs under windows 10 with 8 gigabytes of ram and core I5 (1.8) GH of intel cpu.
In the query side, we observed a query as a vector.The four preprocessing steps are applying on it, finding the synonyms depending on the WordNet, finding the best synsets, where the weight of each term is calculated by using TF-IDF, and finally, using the proposed system to measure similarities (cosine, Jaccard) to find the most retrieved documents.As illustrated in Figure-3, the system reads a specific query which it searches for, along with the minimum threshold for the similarity measure value, and clicks on the search button.The search button will perform the following steps: Query Preprocessing, Query Expansion, Term Selection, Term Weighting, and Similarity Measure.Because of the restricted area, this paper shows an example for one query and its result.Table-1 shows the TF-IDF weight for the first 20 documents when the system reads the query "earthquake in Washington".

Hussain
Iraqi Journal of Science, 2021, Vol. 62, No. 6, pp: 2053-20632061 The proposed system assigns an assumption weight to the query terms; the original term will be assigned 1 and the synonyms will be assigned 0.5, as shown in Table-2.In the cosine similarity measure, document 180 is the document that had the top-scoring for this query, with a score of 0.8381962, whereas document 105 had a score of 0.7941380, and document 323 was the third with a score of 0.7941380.Whereas using Jaccard similarity, document 323 was the document with the top-scoring for this query, with a score of 0.0230769, whereas document 420 scored 0.0194174, and document 316 scored 0.016.Table-3 demonstrates the top ten scoring documents for the above query.The proposed system was evaluated using precision, recall, and F1 evaluation measures, as explained in the above equations (4, 5 and 6, respectively).This system was applied on samples of the random queries.Table-4 illustrates the precision, recall and F1-measure of 5 random queries with minimum thresholds of 0.7 for cosine similarity and 0.003 for Jaccard similarity.It is worthy to note that these results are running on the first (50) documents.

Conclusions and Future Work
Based on the results obtained from this work, a number of conclusions were obtained regarding the projected system.

Hussain
Iraqi Journal of Science, 2021, Vol. 62, No. 6, pp: 2053-2063 2062  The query expansion is capable of overriding the problems of vocabulary mismatch in IR systems. The incompatibility between query items and document items highly affects the effectiveness of the retrieval operation. The precision and recall measures focus on the assessment on the retrieve of true positive documents.These measures will provide us with the percentages of the existing relevant documents and the false positives documents. The precision measure in the cosine similarity is better than that in Jaccard similarity because all the retrieved documents with it are relevant.On the contrary, in Jaccard similarity, not all the retrieved documents are relevant. Giving a high priority (high weight) to the original terms of the query will give much better results of similarity and evaluation measures. In the query expansion phase, the count of the synonyms for specific words may be large, and thus a pruning operation is required to reduce them. The obtained results confirmed that the cosine similarity measure is better than Jaccard similarity measure because its retrieved documents have more accuracy .We notice from the results that the average precision of cosine (for random queries) = 100%, whereas that of Jaccard = 84.4%, and the average recall of cosine = 86.8%, while that of Jaccard = 73.4%.The average f-measure of cosine = 92% whereas the average f-measure of Jaccard = 76%.
In the future work, the use of genetic algorithms or any optimization algorithm in information retrieval may be better than the use of similarity measures because several related documents will be retrieved to the system in the genetic modification.

HussainIraqiFigure 3 -
Figure 3-The Results for "Earthquake in Washington" Query

Table 2 -
Assumption Weight for "earthquake in Washington" query

Table 3 -
Cosine and Jaccard Similarity Measures for "earthquake in Washington" query