An Exponentiation Method for XML Element Retrieval

XML document is now widely used for modelling and storing structured documents. The structure is very rich and carries important information about contents and their relationships, for example, e-Commerce. XML data-centric collections require query terms allowing users to specify constraints on the document structure; mapping structure queries and assigning the weight are significant for the set of possibly relevant documents with respect to structural conditions. In this paper, we present an extension to the MEXIR search system that supports the combination of structural and content queries in the form of content-and-structure queries, which we call the Exponentiation function. It has been shown the structural information improve the effectiveness of the search system up to 52.60% over the baseline BM25 at MAP.


Introduction
Nowadays, the XML (http://www.w3.org/TR/xml11/) research is willing increasingly more documents having the structure with respect to certain structural [1]. Exploiting this structure is a significant part of improving retrieval effectiveness which can be divided into two categories: using document structure and user queries. Several form of the document's structure based retrieval models have been developed, such as BM25F [2] ranking function that is composed of several document fields with potentially different degrees of importance; PRM-S [3] is based on probabilistic retrieval model; and FRM [4] is the relevance feedback function based on the language model. Broschart and Schenkel presented the proximity weighting to improve the search system [5]. On the other hand, it is based on user queries, such as QRX [6] which is based on tree matching model without knowing the exact structure of the data, using the similarity measure of the vector space model. Unfortunately, this method has a drawback on the efficiency issue. The weight has been based on depth of the path and location in the document logical structure and then used as probabilities function based on the language model [7]; the length has been used as a normalization incorporated through a prior probability in the ranking function [8]. In [9,10], highlight the structure weight in TopX (http://topx.sourceforge.net/) search engine. It assigns a small constant and tunable score for every navigational condition that is matched to query by using the frequency of the tag name. The weight has also been calculated based on the distribution of tag names which is used in a way similar to the binary independence retrieval model, but investigating the presence of tags in relevant and nonrelevant elements, to estimate the tag weights [11]. In [12], it is shown the structure does not improve the effectiveness of the retrieval system much because the users are very bad at giving structural hints with respect to INEX-IEEE collection and it requires further investigation. In this paper, we are investigating retrieval technique and related issues over a strongly structured collection of XML documents with the Initiative for the Evaluation of XML Retrieval (INEX) (https://inex.mmci.uni-saarland.de/) collections based on user queries. With richly structured XML data, we have been shown that the structural information using the Exponentiation function could be utilized to improve the effectiveness of search systems. This paper is organized as follows. Section 2 reviews the data model and notions. Section 3 explains the presents state of the art approaches. Section 4 shows the experiment results and discussion; conclusions and further work are drawn in Section 5.

Data Model and Notions
In this section, we provide some historical perspectives on areas of XML research that have influenced this article as follows.

XML Indexing Methods.
The basic XML data model is a labeled, ordered tree. Figure 1 shows the data tree of an XML document based on the node-labeled model.
Classical retrieval models have been adapted to XML retrieval. Several indexing strategies have been developed in XML retrieval as shown in Figure 2.
Element Base indexing [8] allows each element to be indexed on the basis of both direct text and the text of descendants. This strategy has a major drawback in that it is highly redundant. Text occurring at the nth level of the XML logical structure is indexed n times and thus requires more index space. This strategy is illustrated in Figure 2(a), where all elements are indexed. Leaf-Only indexing [13] allows The Scientific World Journal 3 indexing of only leaves through element or elements directly related to text. This strategy addresses the redundancy issues noted above. However, the propagation algorithm for the retrieval of nonleaf elements requires a certain level of efficiency. This strategy is illustrated in Figure 2(b), where the leaf elements are indexed. Aggregation-Based indexing [14] uses the concatenated text of an element to estimate a term statistic. This strategy has been used to aggregate term statistics directly on the basis of the text and its descendants. This is illustrated in Figure 2(b), where the leaf elements are indexed. Selective indexing [13,15] involves eliminating small elements and elements of a selected type; this strategy is illustrated in Figure 2(c), where only semantic elements are indexed. Distributed indexing [15] is separately created for each type of element in conjunction with the selective indexing strategy, as shown in Figure 2(c). The ranking model runs each index separately and retrieves ranked lists of elements. These lists are merged to provide a single rank across all element types. To merge lists, normalization is performed to take into account the variation in elements size across the different indices such that scores across indices are comparable.

XML Query Languages.
Querying in structured documents must be with respect to content and structure. INEX identified two types of queries [23,24]; they are content only (CO) and content and structure (CAS) as follows.

Content Only
Queries. These queries are formed by ignoring the document structure, in the same way as the traditional queries used in IR collections. However, they pose a challenge to XML retrieval in that the retrieval results in returning document components, that is, XML elements instead of whole documents in response to a user query. Queries can be elements of various complexities, that is, at different levels of the XML document's structure. This is suitable for XML retrieval where users do not know or are not concerned about the structure, that is, with the logical organization of the document, when expressing their information needs. For example, the best answer for a query "XML retrieval" applied to Figure 1 may be a "section" and not "title" or "p" elements.

Content-and-Structure
Queries. These queries contain conditions of both content and structure. These conditions may refer to the content of specific elements and specify the type of requested answer elements. However, the complexity and the expressiveness of content-and-structure query languages are difficult for the end users because they have to know the logical organization of the document when expressing their information needs. Trotman and Lalmas [12] showed that the structure did not improve the effectiveness of the retrieval system very much because users were normally not capable of giving useful structural hints with respect to INEX-IEEE collection. However, the content-and-structure query can be very useful for expert users in specialized scenarios.

The Narrowed Extended XPath I. The Narrowed
Extended XPath I (NEXI) query language was developed at INEX [25] as a simple query language for content-oriented XML retrieval evaluation. The enhancement comes from the introduction of a new function named "about()". The "contains()" function of XPath, which requires an element (its text) to contain the given string content, was replaced by the "about()" function, which requires an element to be about the content. The NEXI query provides support for the descendant axis as follows.

Structure Weight IR.
Schlieder and Meuss presented the QRX [6] which is based on tree matching without knowing the exact structure of the data of the similarity measure of the vector space model; an element score is computed as follows: Stephen et al. [2] and Robertson and Zaragoza [26] present BM25F as an extension of the baseline BM25 [27] scoring function that is adapted to score field documents. Using the BM25F scheme presented in [28], an element score is computed as follows: where Score( , ) measures the relevance of element to query , , is a weighted normalized term frequency, is a common tuning parameter for the BM25, and is the inverse document frequency weight of term .
The weighted normalized term frequency is obtained by first performing length normalization on term frequency , , of term in field in element as follows: where is a smoothing parameter, len , is the length of field , and avglen is the average length of elements in the entire collection after multiplying the normalized term frequency , , by field weight : Kim and Croft [4] recently introduced the Field Relevance Model (FRM). FRM employs the notion of field relevance and a corresponding retrieval model between query terms and document fields, which are calculated by Field Relevance given a query = 1 , . . . , , and field relevance ( | , ) is the distribution of per-term relevance over document fields. Field Relevance Model is based on field relevance 4 The Scientific World Journal estimates ( | , ); the Field Relevance Model combines field-level scores ( | , ) for each document using field relevance instead of weights as follows: Broschart and Schenkel [5] presented the use of proximity-aware scoring functions that lead to significant effectiveness improvements for XML retrieval. This method introduces modified proximity scores that take the document structure as follows: To compute the proximity part of the score for each term , at first compute an accumulated score acc that depends on the distance of this term's occurrences in the element to other terms, adjacent query term occurrences using for each adjacent occurrence of a term at distance to an occurrence of , the acc , grows by ( )/ . The proximity score is computed as follows: where Score( , ) measures the relevance of element to a query , acct is calculated by ( )/ . Ogilvie and Callan [7] is based on language models and employs element-based indexing. Given a query , terms for each element and its corresponding element language model ⊖ , the element is ranked as follows: where ( ) is the probability of relevance for element and ( | ⊖ ) is the probability of the query generated by language model ⊖ . For instance, where ( | ) is estimation of term in element , ( | ) is the probability of term in collection , and is the smoothing parameter.
To account for the length of an element , and in particular for the heavily biased distribution of small elements in XML documents, which can be used to set ( ) as follows [8]: where length is the length of element and ∑ length is the length of element occurring in collection . Theobald et al. [10] present the extended BM25 function in the TOPX, which is known as the Compactness of the baseline BM25 as follows: where len( ) is the length of element with tag , avel is the average length of elements in the entire collection with tag , 1 , and is a common tuning parameter for the BM25.
The modified function provides a dampened influence of the , with tag . However, this strategy is limited in that each tag name must be the same to implement automatic grouping and weight calculation.
The idea is to associate a weight to a structural constraint to reflect its significance. These weights are then used in the scoring function used to estimate an element relevance.With the increased availability of the data-centric a need for query in both structure and content of the XML documents has become explicit. As a result, a more complex information source is available, in fact, allowing us to improve the performance of search systems. Our approach considers the use of structure weight method, as discussed in Section 3.

Method
In this section, the search results become more refined at every step, and the refinement ultimately narrows down a set of potentially interesting documents. Below we describe our approach in more details.

3.1.
Step 1: Elements Score. Firstly, we defined Score( , = ) is a score for the relevance of a term of an element and then we used the baseline BM25 [27] in Sphinx (http://sphinxsearch.com/) [29] formula to score the element nodes according to query terms contained in content conditions as follows: Score ( , = ) where Score( , = ) measures the relevance of element to query term , , is the frequency of term occurring in element , len( ) is the length of element , avel is the average length of elements in the entire collection, and 1 and are used to balance the weight of term frequency and element length.
And then, we compute the inverse element frequency as follows: The Scientific World Journal where is the inverse element frequency weight of term , is the total number of an element in the entire collection, and is the total element of a term occur.
For an "about()" function in NEXI operator with multiple terms that appeared to an element , the aggregated score of is simply computed as the sum of the element's scores for each term 1 , . . . , conditions as follows: 3.2.
Step 2: Score Sharing Function. In the second step of our approach [30], we compute the scores of all elements from (14), in the collection that contains query terms. We consider the scores of elements by accounting for their relevant descendants . The scores of retrieved elements Score( , ) are now shared between the leaf node and their parents in the document XML tree according to the following scheme: where Score( , ) is a current parent node, Score( , ) is a relevant child of element , and is a tuning parameter.
IF {0 − 1} THEN preference is given to the leaf node over the parents.
OTHERWISE, preference is given to the parents.
is the distance between the current parent node and the leaf node.

3.3.
Step 3: Exponentiation Weight Function. The third step of our approach is the structure score evaluation. To improve the search result with richly structured, we assume that a query is composed of content (keywords) and structure The bold font refer to the % that use to calculate value of improvement.
constraints. The document-query similarity is evaluated by considering content and structure separately. We then combine these scores to the set of possibly relevant elements. Our structural scoring model essentially counts the number of navigational (i.e., element name-only) query conditions that are satisfied by a result candidate and thus considering the content conditions matched for the user queries. It assigns for every directional condition that matched the element name name ∈ path (i.e., an absolute path on the document structure). We analysed the structure for each topic in INEX as shown in Table 1 with respect to the INEX content-and-structure queries and each topic is including a few structure indications. Thus, we are proposed the novel of structural scoring when the user query is matching the structural constraints against the document tree using the Exponentiation is . In order to evaluate the sensitivity of the Exponentiation, we have variation in the value of parameter, including base 10, base e, base 2, and base 1/2 as shown in Figure 3. According to the trend of the graph more smooth than other values and the powers of 2 are important in computer science because there are 2 possible values for an n-bit binary variable. Thus, we simply for our algorithm calculate base on 2 . After that we recomputed the element score Score( , ) as follows: where is the frequency of navigational condition that is matched with the name ∈ path .
In the following, we define ( ) as the set of all elements in that match the target element of the query. In document mode, every document inherits the aggregated score among all target elements , and these document scores Score( , ) determine the output ranking among documents as follows: 6 The Scientific World Journal

Match any
The final weight is a sum of weighted phrase ranks for matching any of the query words.

Match phrase
The final weight is the sum of weighted phrase ranks for matching the query phrase, which requires a perfect match.

Match extended
The final weight is the sum of weighted phrase ranks and the BM25 weight, multiplied by a thousand and rounded to the nearest integer.  To see how users use structure in their queries, for instance, the user query needs "retrieve document sections with the paragraph contains xml retrieval" as follows: //section[about(//p, "xml retrieval")] The first filter looks for occurrences of the term "xml" and "retrieval" in elements whose context matches the path "//section//p" on the path . It is possible to assigning more weight for the return element . In this case, we assume the Score( , ) for each element is 10, is 0.7 and then the calculations are shown in Figure 4.

Experiment Setup
In

INEX Evaluations.
The effectiveness of the retrieval results will be evaluated using the metrics as that in traditional IR, for example, precision, recall, MAP, P@10, P@20, and P@30 [31,32]. Given a topic and a set of documents , each tested IR system returns an ordered subset = 1 , . . . , of , ranked by the system's estimate of the likelihood that each document is relevant to . Several effectiveness   Table 7: Best performing runs based on MAP over the information topics.

Results and Discussion.
In this section, we tuned the parameter using INEX-2005 ad hoc track evaluation scripts distributed by the INEX organizers. Our tuning approach was such that the sums of all relevance scores are maximized and then the total number of leaf node is 2500 and the parameter is set to 0.60. Following that, we used the Sphinx parameters for the BM25 where 1 = 1.20 and = 0.00 and the entire Sphinx match mode values in our experiment include MATCH ANY (TF), MATCH PHRASE (PHRASE), and MATCH EXTENDED (BM25) and are provided in Table 2. The main components of the MEXIR [33] retrieval system are as follows.
(1) When new documents are entered into the system, the Absolute Document XPath Indexing (ADXPI) [34] indexer parses and analyzes the name of each element and its position to build inverted lists for each index in this system. (2) The SphinxDB search engine is used to build both indices in the system. The Selected Weight index is based on term frequency, and the Leaf Node index is based on the classic BM25 function. (3) The Score Sharing function is used to assign parent scores by assigning a proportion of the scores of the leaf nodes to their parents using a top-down approach.
(4) The Exponentiation function is used to adjust the element scores based on linear combination.
The MEXIR search engine retrieves XML elements based on the leaf node indexed with respect to the significant words including the Exponentiation and Score Sharing functions, and then we combine relevance score from the element into the document score. Thus, the document with the higher relevance score will be chosen as the retrieval set. The details of experiment are shown in Table 3.
The performance of different features and ranking methods can now be evaluated. In order to deepen into the analysis of the Exponentiation scoring function, we have also run experiments to study the impact of structure weight with the content-and-structure query in the performance. Table 4 shows the results compared for the best performing runs with and without Exponentiation technique. The p16-BM25-EXPO used the Exponentiation for boosting element score, and the p16-BM25 is the baseline BM25 and then the Exponentiation function was shown to improve the effectiveness of search system measured in terms of MAP, P@10, P@20, and P@30 and are 52.60%, 50.60%, 54.16%, and 58.79%, respectively. Table 5 shows the results compared for the best performing runs with and without the Score Sharing technique. The p16-BM25-EXPO is used the Exponentiation and the used the Score Sharing is the p16-SS-SW and then the Exponentiation weight shown improve the effectiveness of over the Score Sharing technique measured in terms of MAP, P@10, P@20 and P@30 are 81.58%, 82.92%, 75.09% and 67.83%, respectively. It can be seen, that p16-BM25-EXPO obtained the best performance, although the improvement over both the baseline BM25 and the Score Sharing is significant for most of the considered metrics. The significance ( ) was computed with a 2-tailed t-test as shown in Table 6. The p16-BM25-EXPO improved by 0.48% over the baseline BM25 at MAP, and 0.75% over the baseline BM25 with the Score Sharing at MAP on INEX-IMDB collection.
In this analysis, we take the results that were obtained from BM25 over the Exponentiation and compare them with the results from the baseline BM25 and over the Score Sharing function. It is shown again that Exponentiation The Scientific World Journal 9 works well with the document-centric XML documents. We can conclude that significant improvement of results of the Exponentiation function can be obtained from the contentand-structure query and document structure. This finding suggests that it is possible to improve the TF, PHRASE, and the baseline BM25 approaches, which are the usual benchmarks in INEX. The main conclusion that can be drawn from the experiments is that the Exponentiation function is successful in structure weight and could be utilized to improve the effectiveness of search systems.
Another major conclusion, is that we analyzed the effectiveness of the runs for each of the three topic types with respect to the INEX [17] and the results are presented in Tables 7, 8 In this analysis, we take the results that were obtained from the INEX report [17]. It is shown again that our system works well with the List and Informational topics of the document-centric XML documents measured with the MAP metric. Unfortunately, on the known-item topics, the relevant answer is a single document; in this area, the performance was not satisfactory and so further investigation is required.

Conclusions
With the increased availability of the data-centric a need for query in both structure and content of the XML documents has become explicit. As a result, a more complex information source is available, in fact, allowing us to improve the performance of search systems. In this paper, we are investigating retrieval techniques and related issues over a strongly structured collection using the Exponentiation weight for the document's structure over the content-and-structure query, in the data-centric track of the INEX 2011. Our expectation is that structure weighted will improve the effectiveness of the search systems. In terms of processing time, our system required an average of one second per topic. In addition, our run for the ad hoc task showed that the structural information could be utilized to improve the effectiveness of the search system over the baseline BM25 measured in terms of MAP, P@10, P@20, and P@30 and are 52.60%, 50.60%, 54.16%, and 58.79% and over the Score Sharing technique measured in terms of MAP, P@10, P@20, and P@30 and are 81.58%, 82.92%, 75.09%, and 67.83%, respectively. The success of our ad hoc run indicates that indexing the complete XML structure of IMDB and the structure weights are necessary for effective document retrieval in the search system.
In future work, we will look closer at the relative value of various types of metadata, tags, and subject headings. We will also look at the different weighting methods underlying the relevance judgements and topic categories, such as blind feedback and recommendation search.