MapReduce Implementation of an Improved XML Keyword Search Algorithm

Extensible Markup Language (XML) is commonly employed to represent and transmit information over the Internet. Therefore, how to effectively search for keywords of massive XML data becomes a new issue. In this paper, we first present four properties to improve the classical ILE algorithm. Then, a kind of parallel XML keyword search algorithm, based on intelligent grouping to calculate SLCA, is proposed and realized under MapReduce programming model. At last, a series of experiments are implemented on 7 datasets of different sizes. The obtained results indicate that the proposed algorithm has high execution efficiency and is applicable to keyword search of massive XML data


INTRODUCTION
Extensible Markup Language (XML) is the standard for data exchange on Web. Massive data on Web are stored and transmitted by means of XML, so it is a hot research topic, at present, to store and search XML data in an efficient and accurate way. Wide application of XML makes it worthy to research extraction of information from XML data. In order to obtain desired information from XML data, the users may turn to two query mechanisms: one is structured query language, which grammar is relatively complex and is not conducive to usage by many users; and the other is keyword search through which the information can be obtained.
Aiming at XML keyword search, the main research direction is the smallest lowest common ancestor (SLCA) [1,2] and its series of variation query semantics, e.g., VLCA [3,4] and Exclusive LCA (ELCA) [5,6]. Lots of researches have been made thereon: Xu et al. [1] proposed three kinds of classical algorithms based on SLCA semantics, e.g., Indexed Lookup Eager (ILE), Scan Eager (SE) and stack-based algorithm (Stack), and verified that the ILE algorithm is superior to other algorithms.
Sun et al. [2] proposed Multi-SLCA corresponding to ILE algorithm to enhance algorithm efficiency through jumping many redundancies based on anchor node. Li et al. [3] proposed the stack-based algorithm VLCAStack according to VLCA semantics and meaning Dewey code.Liu et al. [4] analyzed XML data structure and keyword matching model and designed algorithm return generation result. Xu et al. [5] proposed the new semantic ELCA based on the deficiency in SLCA return result and proposed index stack algorithm based on stack structure. Bao et al. [7] proposed XML keyword search based on the related guide order according to SLCA semantics and in combination with XReal and XSeek, and made the quality analysis of searching return result. Chen et al. [8] proposed a kind of join-based algorithm and returned the first K results in XML keyword search in combination with top-K algorithm. Zhou et al. [9] proposed the computation process of FastSLCA and FastELCA based on crossing set operation and put forward the FwdSCLA algorithm and BwdSLCA algorithm to generate SLCA and ELCA. Zhao et al. [10] made node classification of probabilistic XML file by means of extreme learning machine method. Based on the previous researches and SLCA and ELCA semantics, Dimitriouet al. [11] proposed the stack-based algorithm which returned k optimum results.
However, XML keyword search algorithms above only focus on small XML data, and are inadequate to deal with the largescale XML data. The parallelization is the most effective and feasible solution to settle the problem [12]. At present, some scholars have studied the parallel XML keyword query. Zhang et al. [13] gave a simple solution to process XML query using MapReduce framework. Camacho-Rodríguez et al. [14] considered the problem of parallelizing the execution of XQuery and proposed a solution under MapReduce framework. Li et al. [15] realized IMS algorithm parallel in Hadoop programming framework. Zhang et al. [16] proposed a kind of distributed keyword search algorithm based on SLCA. However, it does not optimize the original serial algorithm but performs the parallel on the basis of the original algorithms. According to the feature of XML, this paper proposes an efficient keyword search algorithm for massive XML data based on MapReduce framework.
The main contributions of this paper are as follows: • We demonstrate four properties and propose an Intelligent Indexed Lookup Eager (IILE) algorithm to optimize and improve the classical ILE algorithm.
• We implement the parallel algorithm of IILE based on MapReduce programming model to solve the problem in low efficiency on massive XML data.
• The simulation results demonstrate that the proposed IILE algorithm can effectively and efficiently deal with keyword search on massive XML data.
The remaining of the paper is organized as follows. SLCA and MapReduce framework are simply reviewed in Section 2. Four properties and the proposed IILE algorithm are presented in detail in Section 3. Section 4 analyses and evaluates the results through a series of experiments. Finally, the conclusion and future work are presented in the last Section.

SLCA semantics
SLCA is the important concept for XML keyword query. It is used mainly for defining the significant result returned for the given keyword query, which is the core problem in keyword query research [1,2,6,7,[17][18][19][20]. The concrete definition of SLCA is to solve the subtree root node meeting the two conditions below: • The subtree includes all keyword sequences.
• The subtree includes no smaller subtree but all keyword sequences.
XML keyword query work involves mainly in generation of candidate point set and reduction in SLCA calculation. As the return result of XML keyword query, SLCA reflects the inclusion relation between candidate point sets.

SLCA solution
ILE and SE [1] are the classical algorithms for solving SLCA. The main principle of ILE is to design a kind of data format of B+ tree structure, to facilitate lm operation and rm operation and obtain all corresponding Dewey code sets of given keyword, and the nodes in set are sorted in order of size. ILE is applicable to the keyword set containing low-frequency keyword. However, SE, as a variation of ILE, is applicable to all conditions with little frequency fluctuation of keyword. The two algorithms are intended for solving SLCA of k keywords, on the basis of the same rule, needing k − 1 intermediate variable sets in computation process, and a pair node sets are required as the input for each SLCA calculation before generation of an intermediate result set.
Although ILE algorithm has superior performance, it still has the following deficiencies: firstly, it saves XML data on B+ tree structure and therefore B+ tree structure must be modified to support the necessary Dewey code operation, which is complex to realize. Moreover, B+ index structure is not applicable to Dewey code data. Secondly, SLCA computation process is "obstructed" in ILE algorithm, that is, Dewey set S i must be calculated in turn and the processing of the i − 1 th keyword must be completed before processing the i th keyword [4], which greatly reduces the algorithm speed. Therefore, the traditional ILE algorithm is not efficient for large-scale data.

Hadoop and MapReduce Framework
As an open-source framework, Hadoop mainly consists of the distributed file system (HDFS) and MapReduce. HDFS can realize the efficient storage and management of data on the cloud composed of computer cluster [21]. As a kind of parallel programming model, MapReduce is intended for parallel computing and highly abstracts the parallel computing process and develops it into Map function and Reduce function. Therefore, the compilation of application program in the MapReduce framework is the process of mapper and reducer customization. Such computing model requires that the input, output and intermediate data are present as a key-value pair <key, value>. Map function accepts an input in <k1,v1> form and also generates an intermediate input in <k2,v2> form. Hadoop will integrate v2 set with the same intermediate k2 value and then transfer it to Reduce function. Reduce function will accept an input in <k2, list of value> form and process the value set. Each Reduce output is in <k3,v3> form. Finally, the result will be returned to the application program.

Optimization and Improvement of ILE Algorithm
Aiming at the above mentioned issues emerged in ILE algorithm, an improved ILE algorithm was proposed for data optimization, computer systems science & engineering called Intelligent Indexed Lookup Eager (IILE). The optimization and improvement of ILE algorithm are described from such four aspects as follows: To begin with, the complexity of the classical ILE can be expressed as where |S 1 | denotes the minimum size of keyword lists. k and d indicate the number of keywords and the maximum depth of XML tree, respectively. It is obvious that the algorithm efficiency is hugely affected by S i , and the computing cost can be reduced before SLCA computing by means of decrease in S i (i = 1 to k).
In Property 1,removeAncestor is the function for removing the ancestor node in Dewey code. According to SLCA semantics, to solve the SLCA of keyword set w i is to solve the longest common substring between the corresponding Dewey code set of keyword set. It is obvious that to remove the ancestor node in S i will not affect the correctness of SLCA solution but accelerate the computation process. Next, if S 1 is split into |S 1 |/|P| subsets (subgroups) [1], where P is the buffer of fixed size and subset is expressed by B, then SLCA result can be presented faster by solving slca(B,S 2 ,…,S k ).
Then, it is proven that SLCA solution meets requirements of associative law. The algorithm parallelization will be realized in the following part of this paper and here is a parallel point for realization of algorithm parallelization. Therefore, S 1 can be split and correct SLCA solving is the foundation and guarantee for realization of algorithm parallelization.
If v 12 ∈slca(S 1 , S 2 ), a match {v 1 ,v 2 } with v 12 as the anchor point must exist (where v 1 ∈ S 1 and v 2 ∈ S 2 ), so v 12 Supposing A=slca{S 1 , S 2 } and B = slca{S 3 , In conclusion, slca(S 1 , ..., Finally, in ILE and SE algorithms, SLCA is obtained by removing the ancestor node in SLCAs candidate set directly or indirectly, so the intermediate SLCA candidate set SLCAs is in fact generated based on a massive calculation process. See Example 2 below for intuitive description of the problem.
Two redundant SLCAs will be generated if the result set with only four SLCAs is obtained. In case of large XML document and many keywords, it is a heavy task to remove the ancestor node and the redundant SLCAs such generated is a huge waste of computing cost.
, that is, v i has the longer common prefix than v i+2 with respect to v i+1 . Then, in the process of calculating SLCA, The number of ancestor contained in the SLCA candidate set in the first decomposition condition must not be smaller than that in the second decomposition condition. v i and v i+1 are "closer", so we suppose that the ancestor relationship exists between v i and v i+1 and SLCA is generated by a certain node v in S 2 . We have to remove the ancestor node generated right now after calculating SLCA, to obtain the accurate SLCA under the first condition. However, the judgment can be made for removing the ancestor node in the solution process, without waiting for the completion of decomposition and solution. Therefore, Example 2 can be solved again by the grouping strategy. Now, let's refer to Example 3 below.  (0.1.1.0.1, 0.1.1.1.0)= 0.1.1 and lca(0.1.1.1.0 in SLCA needs no screening after SLCA candidate set obtained from grouping, so the "ancestor relationship" generated in solution process will not be processed and no more computing resource will be wasted. According to lemma1 and lemma2 in Ref. [1], the logic complexity of SLCA solving is simplified. The effect is more obvious in the solution process for many keywords.
Based on such four properties and in combination with Hadoop's parallel mechanism, the ancestor will be removed at first during reading metadata in Map process. And then,S i will be intelligently grouped and each group will be delivered to a Reduce for parallel process. At last, the results are collected and subjected to removing duplicates and removing ancestor.

Realization of Hadoop-based Algorithm
This section will present the realization of IILE algorithm under Hadoop and MapReduce framework. The algorithm realization may be divided into four major parts as shown in Fig.1, which presents the flow chart of SLCA keyword query of keyword Title, Ben and John in School.xml document tree [1] as shown in Fig. 2.

Description of Algorithm
According to above four properties, this subsection gives a detail description of our proposed IILE algorithm based on MapReduce framework. Algorithm 1 first selects the keyword to be looked up and the corresponding Dewey code, and saves all Dewey codes of the same keyword to the same set.
In Algorithm 1, m keywords to be looked up are {w 1 , w 2 ,…,w i ,…,w m }. We select the keyword to be looked up and the corresponding Dewey code through SelectMapper function, and then merge its Dewey code set to save in S i in SelectReducer function. The results are outputted to HDFS and will be used in the next sorting function module. In Algorithm 2, S i will be sorted in the ascending order according to the number of Dewey code. By means of the characteristic of automatic sorting of Reduce,SortReducer function sorts keywords according to the number of Dewey code of the same keyword. The size of SLCA must be smaller than |S min | at last, so the sequence from small to large, must be beneficial to the simplification of computation process.
Then, Algorithm 3 realizes the parallel of IILE algorithm.   In Line 2, ancestor code is removed from Dewey code set according to Property 1. Apart from S 1 , S i is included in the set Ss. Dewey codes are classified intelligently. S 1 is grouped intelligently by function IGA(S 1 ) in Line 13 of Algorithm 3, which is implemented as follows.
Sub-function IGA indicates the intelligent grouping process. Lines 3-4 indicate no grouping is required if the quantity of Dewey code is too few. Lines 11-16 indicate if lca(v j , v j +1 )=lca(v j +1 , v j +2 ), then v j and v j +1 are in the same group.
then v j and v j +1 are in the same group, and v j +1 and v j +2 are not in the same group. Lines 24-30 indicate if lca(v j , v j +1 ) <lca(v j +1 , v j +2 ), then v j and v j +1 are not in the same group.

EXPERIMENTAL ANALYSIS
The paper designs a series of experiments by combining the several factors of XML keywords query, such as number and frequency of keywords, size of XML dataset and number of parallel cluster nodes. Through the analysis of the experimental results, this section also validates the high efficiency of the XML keyword query based on Hadoop.
In our experiments of the paper, 17 nodes are involved in the cluster. The configuration information and operation environments of all nodes are the same. We have built a system platform using Ubuntu10.12, Hadoop 0.20.2 and Java 1.6.0_21. One of them is used as the master node and the rest are used as slave nodes.

Queried keyword information
Based on the need of the experiment, 7 sets were selected randomly from each dataset in accordance with the number and frequency to search for the keyword groups. As shown in Table  2, 49 keyword groups were available in total, which were numbered from Q1-Q49. Each group is separated by commas, and the number and frequency of the keywords were separated by "-" [7]. For example, 4-100 means 4 keywords and the frequency of each keyword is 100 (1±10%), i.e. the frequencies of all the keywords that appear 90-110 times are all 100.   To test the running time of proposed IILE algorithm, we change the size of nodes in the cluster from 2 to 16 in the first experiment. Fig. 3 illustrates corresponding results. It is obvious that the running efficiency on small-scale dataset is not ideal, such as data1-data4, as shown in Fig. 3. The running time changes not obviously and even more slowly while the size of nodes gradually increases. However, for large-scale datasets, such as data5-data7, it shows obvious advantages. Fig. 3 (e, f, g) indicates that the running time obviously decreases when the size of nodes changes from 2 to 4, while it changes slowly when the size of nodes reaches 8. The query condition has small effect on the running time. For the same dataset, the running time in the query conditions of 4-5000 and 16-100 is longer than that in other query conditions. It indicates that the size and the frequency of

Experiment 2: change the number of cluster nodes and compare the speed-up ratio
We have introduced the speed-up ratio to analyze the experimental results and measure the performance and efficiency of parallel system. The speed-up ratio is defined as speed-up ratio = Running time on single machine/Running time on clusters. From the speed-up ratio curve in Fig. 4, we can clearly see that, the speed-up ratio shows a tendency of increase as the size of nodes increases. However, it is not obvious in the case of smallscale datasets. We can also see that after 8 nodes, the increase of the speed-up ratio starts to slow down. This is because there is also an information exchange between the nodes of the same dataset, which will occupy some system consumption. It is not true that the algorithm speed can increase unlimitedly as the increase of the number of nodes.

5.1.3
Experiment 3: change the size of cluster nodes and compare the optimal number of nodes Fig. 5 shows the curve chart of the optimal size of nodes in the cluster. We can see from Fig. 5 that, for a specific sized dataset, the running efficiency will increase when the size of nodes increases. But it is not the more the better. As shown in Fig. 5, we can see that for the same data file, the reducing tendency of the running time is slowing down as the size of nodes increases. This is determined by the block mechanism of Hadoop itself whose default data block size is 64 MB, just like the data of 128 MB divided into 2 data blocks and the data of 256MB divided into 4 data blocks. The figure shows that data7 has the best efficiency in the cluster with 8 nodes, while the reducing tendency of running time tends to be gentle if the number of nodes increases.

CONCLUSION AND FUTURE WORK
In this paper, parallelization of MapReduce-based XML keyword search algorithm is further researched to process massive XML data. The grouping-based IILE algorithm is proposed and realized, and then the parallel is carried out by means of Hadoop.
The experimental results show that our proposed algorithm can process SLCA-based keyword search of large-scale XML data. However, some gaps still remain in this paper: (1) the algorithm does not achieve the complete separation of relationship between groups and some ancestor relationships still exist. The reason for these problems may be further analyzed or studied in the next phase. (2) Limited by unavailable Hadoop iteration and other factors, these functional modules shall be written by block. The previous research was conducted from the perspective of repeated starting of Hadoop program. (3) Parallelization of XML Dewey code has not yet been implemented in place. In light of these, emphasis will be placed on these gaps to study efficient algorithm for keyword search and applications of the same in cloud computing environment.