An Improved Distributed Query for Large-Scale RDF Data

: The rigid structure of the traditional relational database leads to data redundancy, which seriously affects the efficiency of the data query and cannot effectively manage massive data. To solve this problem, we use distributed storage and parallel computing technology to query RDF data. In order to achieve efficient storage and retrieval of large-scale RDF data, we combine the respective advantage of the storage model of the relational database and the distributed query. To overcome the disadvantages of storing and querying RDF data, we design and implement a breadth-first path search algorithm based on the keyword query on a distributed platform. We conduct the LUBM query statements respectively with the selected data sets. In experiments, we compare query response time in different conditions to evaluate the feasibility and correctness of our approaches. The results show that the proposed scheme can reduce the storage cost and improve query efficiency.


Introduction
In previous studies, researchers used simple sampling for data analytics. The arrival of big data opens up a new era of data value [1], and dealing with large-scale data analysis becomes a hot topic. The basic elements of a knowledge graph are interconnected entities and corresponding attributes, which can be expressed with subject-predicate-object. Typically, we use RDF (Resource Description Framework) to represent the storage of this ternary relationship. The efficient management of RDF data is at the center of knowledge graph and semantic web development. The traditional centralized relational database structure is rigid and fixed, and the search results are distinct. As RDF data has no fixed pattern, it cannot guarantee low data redundancy, which seriously affects the efficiency of data queries and cannot effectively manage massive data. For example, the query efficiency of a 3-store system [2] and RDF-3x [3][4] is far lower than that of big data distributed queries. Therefore, in order to solve the above problems, more and more solutions based on distributed systems are applied to data storage and query, which is of great significance to solve the explosive growth of data management problems. However, how to efficiently manage the growing RDF data and conduct large-scale data analysis is still the focus of many scholars. This paper combines big data and knowledge graph by using distributed system storage and calculation and proposes a breadth-first path search algorithm based on keyword queries under the distributed platform, so as to successfully solve data extraction, fusion, storage, calculation, and other problems.
This paper is organized as following. We introduce the background and significance of the research and the concepts of RDF and in Section 1 and describe the related work on RDF data storage and querying in Section 2, including RDF, SPARQL, Hadoop and HBase. We outline the relational database storage model and distributed storage model in Section 3. We introduce the whole idea of SPARQL structured query and analyze the keyword query scheme in detail and on the basis of distributed RDF data storage model. We give the implementation process of the breadth-first path search algorithm based on the keyword query in Section 4. Section 5 verifies the correctness and efficiency of the above scheme by experiments. In Section 6, we draw our conclusions.
Our contributions are as follows: • We design an RDF data storage model based on a distributed system, which can be extended by adding cluster nodes.
• We propose a breadth-first search based on keyword query algorithm on the basis of RDF storage and query framework of HBase, which realizes parallel queries efficiently.
• We test the LUBM datasets. By comparing the sample size of triples and query response time, it is shown that our query algorithm is more efficient than H2RDF in processing huge datasets queries.

Related Work
How to store and query RDF data is the focus of many researchers when designing RDF data management systems. Currently, existing RDF data storage methods mainly include RDF data storage based on memory, relational database, object-oriented database, and distributed cluster [5]. The main methods to query RDF data include local query, RDF graph query based on parallel platform and index query. Research demonstrates that the efficiency of storage and query of relational databases in processing massive data is lower than that of distributed databases [6]. Therefore, more and more researchers begin to use the massive storage and parallel computing of distributed clusters to query largescale RDF datasets. For example, Nikolaos Papaiiliou proposed H2RDF distributed system based on HBase and MapReduce in 2012 [7]. At the same time, some mature RDF management systems have been developed, such as Jena [8], Sesame [9], RDF-3X [10][11][12], etc. Query method based on a distributed system has gradually become the mainstream method of large-scale RDF data queries.

RDF and SPARQL
The resource description framework is an infrastructure framework specified by W3C (World Wide Web Consortium) [13] for describing structured data and their interrelationships. It is mainly composed of syntax specification and expression statements. The core data model consists of three object types: Resource, attribute, and statement. RDF datasets consist of resources identified by a unique URL and intended to be readable by a computer. As is shown in Fig. 1, the RDF document can be used to describe "https://www.runoob.com//rdf". The structure of RDF triples <resource-relational-resource> can describe complex entity flexibly and easily and provide a framework and a common classification and query method for different metadata elements. RDF can designate practice tables for Web events, information about Web pages, and content for search engines. At present, research on RDF data management mainly includes two aspects: How to store RDF data effectively and how to query RDF data efficiently. We implement SPARQL and other query statements based on the RDF data storage model and further propose a breadth-first path search algorithm based on keyword queries. SPARQL (Simple Protocol and RDF Query Language) is a standard query language published by W3C for querying RDF data [14]. Its basic format and syntax structure are similar to SQL. Most RDF data management systems currently support SPARQL queries. Most forms of SPARQL queries contain a number of triple patterns called a basic graph pattern [15]. Triple Pattern is the most basic matching unit, and its structure is similar to RDF Triple storage. The SPARQL query is basically divided into three steps. The first step is constructing the basic graph pattern. The second step is matching to the subgraph that satisfies the requirements of the graph pattern. Finally, the result is bound to the corresponding variable. SPARQL also provides four different forms of queries: SELECT, ASK, DESCRIBE, and CONSTRUCT. At present, implementing the SPARQL query of massive RDF data on a distributed platform is the main research direction. A number of SPARQL based extension tools such as C-SPARQL [16], BimSPARQL [17] emerged. In this paper, we query the RDF data with SPARQL statements and test it with LUBM, a standard test data set, on a distributed Hadoop system.

Hadoop and HBase
Hadoop is an open-source distributed computing project that originated in Google's clustering system. Because of Hadoop's low cost and high fault tolerance, researchers often develop and run applications for processing massive data on distributed platforms. For example, Ding et al. [18][19] mainly use a distributed system to realize large-scale data queries. The most important design of the Hadoop framework is a distributed file system HDFS and distributed computing framework MapReduce. At present, the RDF data query based on the distributed Hadoop platform needs MapReduce for batch processing and parallel computation. MapReduce divides job processing into four stages: sharding, Map, sorting and Reduce. Fig. 2 shows the execution flow diagram of MapReduce. In recent years, a lot of research work using the MapReduce framework for the RDF data query. For example, [20][21], these data query systems have flexibility and scalability, but it is difficult to provide interactive queries because the MapReduce framework is limited to the fixed format of input data (key, value). In order to solve the above problems, this paper focuses on RDF storage and RDF query, analyzes and compares storage container selection in the storage system, and proposes a path algorithm based on BFS in query retrieval. HBase is a distributed database based on the Google big data model [22], which can be used to process massive structured or unstructured data and provide random and real-time read-write access to large-scale data [23]. Compared with the traditional relational database, HBase determines value by row key, column name, and timestamp, so it does not need to query by row, but only needs to scan related columns when querying, so it greatly improves the query efficiency and query performance. But HBase abandons atomicity, Consistency, Isolation, Durability and other features of traditional relational databases. Fig. 3 shows a storage schema instance of the HBase table. Myung et al. [24] store the form of RDF data preserving triples directly in HDFS and propose greedy selection and two connection selection strategies to increase the efficiency of MapReduce. This approach is inefficient because there is no index. Therefore, in this paper, we design an RDF data storage model with HBase technology. According to the ontology described by OWL, create two index tables for each class to store the class and property information in the ontology. In addition, the problem of null and multi-valued attributes in the traditional relational attribute table storage is well solved by utilizing the characteristics of column storage and sparsity in the HBase.

Storing RDF
Recently, the RDF data storage model mainly includes relational database storage and distributed data storage. This section compares the two storage methods. Owing to the poor scalability of database storage, we adopt a distributed clustering approach to store RDF data in the experiments in Section 5.

Relational Database Storage
Relational database to store RDF data is a common tool of storage, widely used in various kinds of system programming [25]. Relational database storage has the characteristics of simple operation, easy to understand and complete data structure, but poor scalability. It includes four types of relational storage: horizontal, vertical, decomposition, and hybrid. Current relational RDF stores are difficult to meet the need for efficient management of large-scale RDF data and are unable to efficiently handle complex queries. The inherent mismatch between relational and RDF models further impedes the development of storage, so more and more scholars begin to use parallel or distributed storage management.

Distributed Storage
Due to the increasing amount of RDF data, the traditional node storage method is no longer suitable for massive data. Therefore, researchers begin to study storage methods based on distributed clusters. In 2010, Sun J proposed a distributed storage method using index tables [26]. In 2013, Papailiou N proposed an H2RDF storage scheme and specified the query order of triple pattern according to the greedy strategy [27]. The distributed storage methods can effectively improve the speed of loading data and the efficiency of reasoning query on the basis of ensuring the correctness of data storage. The storage process of the distributed database based on HBase is divided into five stages: uploading RDF data set to HDFS, preprocessing and parsing the data set to get triples, encoding strings, loading RDF data to HBase, and finally designing a reasonable table structure to store RDF data. Fig. 4 shows an example of abstract RDF storage. HBase stores data by column. Therefore, distributed storage can improve the scalability of data management. It is the most common method to deal with large-scale data sets.

SPARQL Query
SPARQL [28] is a structured language for querying RDF data. At present, a large number of researchers use SPARQL to conduct structured queries in a distributed environment. This query is similar to SELECT statements in SQL. Fig. 5 is a simple example of the SPARQL query. The SPARQL query statements contain the triple matching pattern, and the subgraph matching pattern corresponding to the triple matching pattern is shown in Fig. 6. In addition, to support for SELECT queries, SPARQL includes keywords such as DISTINCT, LIMIT, and ORDER BY, which further improves query efficiency. SPARQL query methods mainly include triple pattern query, basic graph pattern query, and complex query. A common SPARQL query consists of multiple triple queries, each of which takes the form < subject-predicate-object >. The query pattern of RDF data is mainly based on the basic graph pattern. For example, in 2009, Zhu [29] demonstrated that the graph structure storage of RDF data has the advantage of avoiding reconstruction and direct mapping. However, because the graph database does not provide the SPARQL query interface, researchers need to perform complex language conversion when searching, which limits the development of SPARQL language to some extent.   The query efficiency of retrieval methods based on the RDF graph is generally low, so the query method based on keywords has become a hot research topic in the field of graph query. At present, there are two main algorithms for querying RDF graphs: one is to query each subgraph through graph partitioning strategy, and the other is to map a key summary graph in the RDF graph. The query results of the RDF graph are obtained by path query on a key summary graph. Keyword query on RDF datasets is divided into two steps: Firstly, the mapping relationship between the keyword and the RDF graph structure is established; secondly, the breadth-first search is applied to RDF elements in the mapping relationship. The result is a query subgraph that contains all the elements that match the keyword requirements. On the basis of the structured query, this paper analyzes the keyword query method, makes a further improvement, and proposes a breadth-first path search algorithm based on keyword queries. The query algorithm is mainly divided into two steps. First, the path at the beginning of category Ci will be stored in the Start[n] array, and the path at the end of category Ci will be stored at the End[n], and all paths of the entity class will be initialized. Second, the breadth-first path search is carried out on the input path. Finally, the mapping of the keyword query to the structured query is realized. Algorithm 1 shows the pseudo-code of the breadth-first search based on keywords. A large number of experiments show that the coverage is wider, the query efficiency is higher by using the keyword query method. Our BFS algorithm is based on keywords as shown in Algorithm 1. The principle of breadth first algorithm is first in, first out. After each class Ci is queued, it needs to access all its adjacent classes. The time complexity is O(n). The time complexity of breadth first module is O(n+e). The time complexity of union module is O(n2). The complexity of our BFS algorithm is O(n2+n+e), that is O(n2).

Experiments
In this section, we conduct the storage model and retrieval query algorithm presented above on the datasets. This experiment designs a distributed cluster environment. The simulation environment is configured with Intel(R) Core(TM) i7 2.60 GHz CPU, 8GB of memory and 500 GB of hard disk space. The development environment is Java SE 1.8, Hadoop version 2.7.3 and HBase version 1.4.9. The experimental data set is the LUBM standard test set. We verify the query algorithm for different query methods in LUBM, select the running time as the index, analyze and compare the change of query performance of keyword query algorithm before and after optimization, and compare it with traditional algorithm. The experimental data set is the LUBM standard test set. We use query methods in LUBM to verify our new scheme, select the running time as the evaluation index. We also analyze and compare the change of query performance of the keyword query before and after majorization.

Datasets
This experiment uses the LUBM (Lehigh University Benchmark) [30] test set. LUBM provides both data sets of different sizes and 14 standard query statements. Tab. 1 is a characteristic description of the queries. High Low ----- In the experiment, we use UBA's data generator to generate six different sizes of RDF data sets. Di represents a data set containing (i) schools. The selected data sets correspond to the number of 1, 5, 10, 50, 100 and 200 schools respectively. Each test set is shown in Tab. 2 The generated data is stored in HDFS and then loaded into the class's storage table with a MapReduce. Since some of the query statements in LUBM are designed to test the reasoning ability of the engine, we select typical statements for testing.

Evaluation
We conduct LUBM's three typical query statements Q1, Q5, and Q6 on six datasets, and compare them with the traditional H2RDF system. The average response time is the average execution time of each query running 5 times on different data sets. The response time of the LUBM query statement is shown in Tab. 3, in milliseconds. Experimental results show that under the Q1 simple query, compared with the traditional H2RDF system, this scheme has no obvious advantage in response time, and the response time varies little when the data sets multiply. For Q5, which is less selective but has a complex reasoning relationship, the response time increases faster as the data sets grow. For Q6 with high selectivity, the response time is the slowest and the growth is the fastest. Compared with the H2RDF system, we prove that both the proposed distributed storage model and the breadth-first path search based on keywords are efficient and feasible.

Conclusion
Both Traditional data storage and structured query have a long execution time and slow response time in the distributed scenario, which cannot meet the requirements of querying the massive RDF Data. We analyze and compare the relational database storage with distributed one, and develop an improved distributed storage scheme based on HBase. Furthermore, we compare the SPARQL structured query and keyword query and propose a breadth-first path search strategy. The experiment indicates the feasibility and efficiency of this scheme by comparing average response time and query efficiency in the LUBM dataset. Our future work will investigate the keyword query method, the mapping relationship between keywords and RDF elements. The conversion rate and accuracy of keywords still need to be further improved.

Conflicts of Interest:
The authors declared that there is no conflict of interest for this paper.