VEDAS: an efficient GPU alternative for store and query of large RDF data sets

Resource Description Framework (RDF) is commonly used as a standard for data interchange on the web. The collection of RDF data sets can form a large graph which consumes time to query. It is known that modern Graphic Processing Units (GPUs) can be employed to execute parallel programs in order to speedup the running time. In this paper, we propose a novel RDF data representation along with the query processing algorithm that is suitable for GPU processing. Since the main challenges of GPU architecture are the limited memory sizes, the memory transfer latency, and the vast number of GPU cores. Our system is designed to strengthen the use of GPU cores and reduce the effect of memory transfer. We propose a representation consists of indices and column-based RDF ID data that can reduce the GPU memory requirement. The indexing and pre-upload filtering techniques are then applied to reduce the data transfer between the host and GPU memory. We add the index swapping process to facilitate the sorting and joining data process based on the given variable and add the pre-upload step to reduce the size of results’ storage, and the data transfer time. The experimental results show that our representation is about 35% smaller than the traditional NT format and 40% less compared to that of gStore. The query processing time can be speedup ranging from 1.95 to 397.03 when compared with RDF3X and gStore processing time with WatDiv test suite. It achieves speedup 578.57 and 62.97 for LUBM benchmark when compared to RDF-3X and gStore. The analysis shows the query cases which can gain benefits from our approach.

In 2008, the World Wide Web Consortium released the standard query language for RDF called Simple Protocol and RDF Query Language (SPARQL). It is a query language similar to an SQL in a traditional database with the supports of basic query operations such as filtering, join, projection, sorting, etc. but it is capable of querying across RDF data in the network where endpoints are applicable.
To efficiently retrieve query results from the large RDF data, we need an RDF processing platform that can perform a SPARQL query in an acceptable time. The high performance and cost-efficient hardware accelerator like GPU is one of the platform solutions. Nevertheless, we have to face some challenges for building applications for GPU: 1 The GPU has limited resources while RDF dumps are a large text file. 2 The transfer latency between the host and GPU can degrade the speedup gained.
Normally, GPU processing requires all data kept in the GPU memory. 3 The number of threads in GPUs is large. Utilizing them at the same time can increase the processing speedup.
In this work, we propose a framework based on the TripleID representation [6]. To make it fit inside the GPU memory, we compress the representation by transforming them into a column format with column indices. This can save a lot of memory since the RDF data is usually sparse. Next, we propose the algorithm for primitive SAPRQL query operations such as select and join on our new representation. We also propose the pre-upload filtering technique that can reduce the data transfer between the host and GPU memory. In the experiments, we compare our column-based representation with the compressed exhaustive indices representation in RDF-3X and graph representation, gStore, in the aspect of size and query processing time. The results are promising due to the decrease in query time and storage size. In short, VEDAS has the following benefits.
1 The representation is based on TripleID which can save the storage size up to 65% compared to the N-triple format. The representation considers proper indices to allow fast tuple querying. 2 It provides support for basic operations in querying which considers the GPU resource properly, e.g., the limited GPU memory and transfer overhead, and the use of massively parallel threads. 3 It works well in the case of the query that contains lots of join operations. For example, in the experiments, the query type C yields the speedup up to 284 compared to gStore and 13.09 compared to RDF-3X. These join operations lead to the large thread workload that significantly hides the transfer time.
The structure of this paper is as following. Section "Background and related work" explains the related work and the inspiration of our work. Section "VEDAS framework and operations" presents our representation and the proposed processing algorithms based on the GPU. The experiments, comparison results, and analysis are described in Section "Experiments". Then, Section "Extension to other operations" discusses the extension to cover complex query types. Finally, Section "Conclusion" and Future Work concludes the work and discusses the future implementation.

Background and Related work
In this section, we first present the preliminary knowledge on Resource Description Framework and SPARQL. The background on GPU processing is also included. Next, we highlight the related work in RDF representation and query processing areas.

Resource Description Framework (RDF)
There are many representations for Resource Description Framework (RDF) data such as N-Triples, N3, N-Quad, RDF/XML, Turtle, etc. The simplest and most popular representation is N-Triples where each statement (or each line) contains a triple of the form subject, predicate, object where predicate expresses the relation between the subject and the object. Each term, subject, predicate, and object can be any IRI string. Such example of representation in N-Triples is shown in Fig. 1.

The triple implies that
Air is a subclass of AbioticEntity which is based on RDFS vocabulary. <http:// www. owl-ontol ogies. com/ Biodi versi tyOnt ology Full. owl# Air> is a subject, <http:// www. w3. org/ 2000/ 01/ rdf-schema# subCl assOf> is a predicate, and <http:// www. owl-ontol ogies. com/ Biodi versi tyOnt ology Full. owl# Abiot icEnt ity> is an object. These terms are IRIs and are obtained from biomedical ontology [7]. The above N-Triples can be converted into RDF/XML as in Fig. 2: Another interpretation of the RDF data is a directed labeled multigraph. Subject and object are vertices in a graph and predicate are edges that connect its corresponding subject and object as shown in Fig. 3. The graph is shown in Fig. 4. The example implies that Bob is male, and he knows Alice. Alice's birthday is May 20, 1997. Alice is the founder of YumYum restaurant that has fried rice on the menu.
SPARQL is a query language that is commonly used for RDF data. A SPARQL's SELECT statement is analogous to the SQL SELECT statement. Based on N-Triples, the query can select subjects, predicates, and/or objects of the triples. Like a normal SQL, a query can contain subqueries. In Listing 1, there are two subqueries: (1) find the journals with the title The Journal of Supercomputing and (2) find all authors from the above journals. In the query, ?authors are variables whose values are the final answers for the SELECT statement. dc is an abbreviation prefix of <http:// purl. org/ dc/ eleme nts/1. 1/> which is a standard vocabulary resource from Dublin Core [8].
In the big data era, RDF data is popular since it is a kind of NoSQL which has the information linkage, and with the trend of data governance, such a standardized form is encouraged. The RDF data size is rapidly growing. Examples of large data sets include GeoSpatials (1.888M Triples), U.S. Census data (1 billion triples), World Bank Linked data (160 million triples), DBpedia (247 million triples), etc.
[9] One of the challenges in this domain is to retrieve and process them efficiently. SPARQL is a de facto standard for querying RDF data. The syntax is similar to SQL in the relational database. For example, "SELECT ?x ?y WHERE { ?x founder ?y . ?y isA Restaurant . }" is a query that lists all pairs of person names that are restaurant founders and restaurant names. In Fig. 3, the result for ?x is Alice and for ?y is YumYum. The SPARQL is a sub-graph matching in RDF graph. For more complex queries, SPARQL has a   modifier to describe the query for example, LIMIT for limiting the number of results,  FILTER for filtering results with boolean conditions and UNION modifiers for combining the results. Our work will first focus on the basic query that contains SELECT and WHERE, but we will provide the guideline for applying the implementation to some important modifiers in Section "Extension to other operations".

Graphics Processing Unit (GPU)
In the past, the GPU has been used to accelerate the graphic applications such as games. Currently, many other applications utilize them to improve performance due to its large number of parallel processing units.
The GPU is a Single Instruction Multiple Data (SIMD) architecture which can process multiple data simultaneously with its thousands of cores running on the same instruction. Its architecture groups multiple processing units into multiple Streaming Multiprocessors (SMs). The GPU has thousands of threads executing on these SMs, and are controlled by the scheduler.
The GPU is located inside a computer (called host) which also has CPU and main memory. It also has its own memory space that is separated from the host memory. Based on the latest GPU technology, it can have a maximum of 32 GB memory per card, which is small compared to the size of the host memory, which can be enlarged to hundreds or thousands of gigabytes. In addition, the GPU has a hierarchical memory layout such as registers, local memory, shared memory and global memory. The global memory has the largest size where the register is the fastest memory. Each thread has its own register and local memory. A group of threads (called a thread block) can access the same shared memory and is executed in the same SM. The global memory is accessible by all GPU threads.
To use the GPU for computation, the data must be transferred from the host memory to GPU global memory. Transferring latency is one of the main overhead incurred in the GPU processing etc. Once the data resides in the GPU memory, the GPU can start its execution. Thus, to maximize the application performance on GPUs, the following are the common considerations.
1 Reduce the transfer data size between CPU and GPU. 2 Hide the memory transfer latency by overlapping processing time and memory transfer time. 3 Maximize the parallelism between all the threads. 4 Optimize the GPU memory usage such as using shared memory to share data among of threads instead of global memory, enabling the locality, adjusting thread memory access pattern to reduce the global memory transfer etc.
In our work, we are interested to utilize the GPU to improve the query performance for RDF data. Due to the above constraints on GPU, we develop the RDF compact representation and introduce the query processing framework that is suitable for GPU processing. The framework contains three basic operations, pre-upload filter, index swapping, and parallel merge-join which optimize the transfer time and enable the GPU parallelism. The framework will be scaled up to support multiple GPUs and a cluster in the near future.

Related works
There are various works on RDF stores and query processing. We highlight the two subareas which are most related to us: RDF representation and parallel query processing.

RDF representation
The RDF data store can be categorized into 3 classes: relational, graph, and matrix representations. The relational approach has been around for a long time [10]. It treats the RDF data as a row in a table of relational databases. Using this approach has a benefit which allows the user to manipulate the data just like in the relational database [11][12][13].
In [14], the SQL query was designed to run on a distributed system. QRDF [15] stores RDF graph with edge list style and uses red-black tree data structure as an update mechanism. Because RDF data is normally large, indexing the triples is important to make it efficient for querying [16]. However, this approach needs high computation power when handling a high number of related data. The join operation is the bottleneck of the system. Another natural approach is to use graph representation. The graph representation shows the relationships among data. gStore [17] is one of the examples that stores the RDF data in this representation. Thus, SPARQL query is represented as a graph. The sub-graph matching algorithm is used to find the result of query in such a representation. gStore has VS*-tree that contains the indices of the data, making the matching process faster [18]. Even though graph approach is more natural to handle the relation, it has the scaling problem for the limited shared memory system [19,20]. Moreover, the irregular access pattern makes it difficult to effectively implement on the GPU to utilize many threads and GPU memory. Qi et al. [21] proposed the dual-store approach that combines the advantage of relational and graph structures. The idea is to store the whole data in the relational database and query the complex queries with the graph database. Graph representation is more suitable for performing the complex query for large datasets.
The matrix representation is an alternative approach that is easy to compress the data and create indices. Yuan et al. proposed TripleBit that stores RDF data in bit matrix [22]. MAGiQ stored RDF in sparse matrix and proposed matrix algebra for query data [23]. The SPARQL query is converted to an equivalent matrix algebra and the existing matrix algebra library (MATLAB and GraphBLAS) was utilized to process. gSMat also stores RDF as a sparse matrix and translates the join operation to the sparse matrix multiplication [24]. Because the matrix multiplication is one of the basic GPU operators, this work implements the join operator on both CPU and GPU (using CUDA). Tentris [25] used sparse order-3 tensors to store RDF graph. This work also uses Trie-liked data structure called hypertries to support the tensor slice operation. The new tensor operation is defined for solving SPARQL queries. Table 1 shows the summary of each representation. All representations require indices to rapidly access the triples. To compress the data, most works replace the RDF terms with unique IDs or bits.
Besides the data representation, query processing and the join operator are also important. MapSQ handles the SPARQL query by using the MapReduce framework in joining [26]. SMJoin utilizes the multi-way join algorithm to reduce the network cost and processing time [27].

Distributed SPARQL query
Some work has considered a distributed approach for SPARQL query processing. Feng et al. [28] classified the distributed RDF system into 3 classes i.e. (1) the one based on the existing general distributed computing framework such as Hadoop, Spark, etc. [29][30][31] (2) the one based on the partitioning method [14,32] and (3) the federated system that integrates multiple systems into one virtual system [33][34][35]. The classification is based on the storage types: partition, graph, and DBMS and two query executive strategies: partition and DBMS. The paper indicates that architecture, storage, and query are key factors for SPARQL query performance. For example, TriAD is based on partition, gStoreD is based on graph, and S2RDF is based on the DBMS. The partition approach (TriAD) seems to outperform the others.
Peng et al. evaluated a SPARQL query using a distributed scheme [32]. In this work, the authors used the partial evaluation and the assembly framework. The authors modeled RDF data as a graph as well as the query. They proposed an algorithm to find a local partial match as partial answers in each fragment from the RDF graph. WatDiv, LUBM and BTC were used as benchmarks for performance measurement. The experiments also compared various cases: a large number of triples, varying intermediate results, each stage performance, partitioning strategy, in-memory operations, etc.
In TriAD, the authors proposed the asynchronous shared-nothing message passing architecture for SPARQL query processing [14]. The approach partitions the RDF graph and distributes the portions. METIS was used for graph partitioning. The SPARQL query was also transformed into a graph and the bindings between free variables and RDF entities are created. The queries were executed in a distributed fashion with a global plan. The LUBM, BTC and WSDTS benchmarks were used for benchmarking.
The literature shows that for distributed system, the aspects that are highly impact to the query efficiency are the architecture and the query type. Both of them affect the intermediate join subquery result size, the joining plan of subquery operations, the data partitioning strategy, algorithms for matching, etc.

SPARQL optimization on GPU
The GPU can process the result matching from the subquery upon the join operation.
To process a subquery, the data for such query must be in GPU memory. Transferring the data to GPU memory usually incurs significant overhead. The performance of the SPARQL query on the GPU highly depends on the representation, which affects the total transfer size and the join algorithm. The join algorithm that is suitable for the GPU and the good query planner are keys to increasing query performance. MapSQ [26] uses the MapReduce technique on the GPU to increase the processing speed. The authors divided the answer processing into 2 steps: (1) finding the subquery results using gStore, (2) merging the results of subqueries by using the proposed MapReduce-based join algorithm implemented on the GPU. They used LUBM benchmark to measure the performance and compared the results against gStore and gStoreD. The speedup gained was ranging from 1.15 to 2.05. SRSPG [29] is similar to MapSQ but it implements a parallel join algorithm on Apache Spark, which is executed on the GPU.
For the matrix-based approach, e.g. MAGiQ [23], it leverages MATLAB-GPU and SuiteSparse package for execution on the GPU. gSMat [24] also uses the sparse matrix based representation and implements its own GPU join algorithm called SM-based join. The gSMat gained the speedup from RDF-3X and gStore ranging 1.87 to 16.13 times against various query types for the WatDiv 500M benchmark. With the sparse matrix library, the query engine has benefits from the new library version optimization. An ordinary matrix does not support multi-graph which is the nature of RDF data. It is also complicated to handle the advanced forms of SPARQL. gSmart [36], another sparse matrix based representation, focuses on heterogeneous architecture. It can execute on CPU and GPU using hierarchical partitioning base on the incident edge. This work closes to MAGiQ but implements the system from scratch instead of using third party libraries.
TripleID-Q [6] relies on the relational row-based format to represent the RDF data and converted the triple into integer IDs to compact the data. The sub-result triples are simultaneously marked by GPU threads. The results are joined with the merge-join approach. However, the work does not consider the query planner and optimization. Table 2 compares the previous works in RDF processing using GPUs.

VEDAS framework and operations
Due to the constraints in the GPU architecture, the main design goals are to minimize memory usage and speed up the query processing time. Figure 5 shows the components of our framework which consists of three parts. (1) data storage and representation, (2) data loader and (3) query processor.
Data storage is where the converted RDF data is kept on the host side. It contains the proposed representation designed for GPU processing. The storage refers to the disk storage where the N-triple data are first kept.
Data loader contains two subcomponents which are parser and indexer. The parser performs the syntax parsing of N-Triple data and the indexer transforms N-Triple data into dictionary and index. Such Triple-ID with the indices format facilitates the search process.
From a SPARQL query, the query processing is done in the query processor. It contains the parser, which parses the SPARQL query and outputs an internal format of query operations. Next, the query planner will find the optimal order of query operations. Finally, the query executor utilizes the query plan obtained from the query planner and applies it accordingly.

Data representation
The N-Triple format contains a string datatype as a basic element (such as IRI). Importing such a large number of triples directly to the GPU is not appropriate since it occupies lots of memory and induces large GPU-CPU memory transfer. Our representation converts the string data into the 4-byte integer (called id). In particular, this step uses a hash function for encoding. In our case, we represent a triple with 12 bytes of memory (4 bytes for a subject, 4 bytes for a predicate, and another 4 bytes for an object). The mapping between an IRI string and the unique integer is saved in a dictionary on the host memory.
Each N-Triple statement contains three terms: subject (S), predicate (P), and object(O) . Each S, P, and O is converted into a unique id, recorded in a dictionary. Thus, the triple statement becomes Triple-ID, tp = id 1 , id 2 , id 3 , where id 1 is the id of associated subject (S), id 2 is the id of associated predicate(P) and id 3 is the id of associated object(O). In general, the intermediate results after performing more than one subquery in a sequence may contain the different number of ids. We denote t as a tuple of (m, id)s, where m is the number of such id.    Figure 7a shows the triples after sorted by SOP to create the column-based representation in Fig. 7b.

Data loader
The data loader is responsible for converting the triple data in the N-Triple format to our Triple-ID format. It also constructs dictionary and complete indices.
For the implementation, Redland raptor 2.2 library was used to parse N-Triples files. After that integer id for each term is assigned and all triple statements are converted to Triple-ID format. To create the permutation index, we employ Thrust library [37] to sort all triples on GPUs in various ways such as sorting by subject/object, subject/predicate, predicate/object etc.
Since we need to sort the large dataset 6 times, the large data transfer of the triple data to GPU memory incurs. However, this process is performed only once for each dataset and the transformed data is stored for future use.

Query parser
Simple SPARQL queries may contain only one subquery. In Listing 2, there is only one subquery: Who knows Alice. The free variable is ?who which is the subject while the bounded variables are knows and Alice.
Some query can consist of more than one subquery, and there can be more one free variables such as: In Listing 3, there are two subqueries: x is the founder of y and y is a restaurant. The free variables are ?x and ?y which are the subject and the object of the first subquery and the subject of the second subquery respectively. The bounded variables are founder, isA, and Restaurant which are predicate, predicate, and object respectively.
Each subquery has a set of Triple-IDs that are matching results. The first subquery extracts the set of subject/object pairs that have predicate knows. The second one will return the set of subjects that are the founder of YumYum. The relational join is used to combine the results, resulting in only rows of the first and second result that has the same ?y. For a SPARQL query that consists of more than 2 subqueries, the optimization may consider the order of joins to reduce the intermediate results as inputs to the next join operation. We will discuss how to order them in the query planner section.
In our notation, a SPARQL query Q consists of l free variables and k subqueries, Q = (SV , SQ) where SV = {?x 1 , ?x 2 , ..., ?x l } and SQ = {sq 1 , sq 2 , ..., sq k } . Real-life SPARQL query can contain any number of free variables and subqueries. In our case, we assume our subquery sq i = �e 1 , e 2 , e 3 � consists of 3 elements e 1 , e 2 and e 3 where e 1 , e 2 and e 3 are free variables or id.
The subquery returns an intermediate result where V is a list of free variables ?x 1 , ?x 2 , ..., ?x m and T is a list of t that is sorted by variable ?x 1 . To construct the intermediate result R from subquery, there are many ways to select which index to be used. To use the different index, the order of free variables is changed correspondingly. The proper index should be selected to reduce the overall processing time.
The query parser converts the given SPARQL query to an internal format. For implementation, the open source Redland's Rasqal [38] is used to parse SPARQL queries. For query Q = (SV , SQ) , we store the free variables in SV and all subqueries SQ to use in the next query planner and executor. For each subquery sq i , the bounded variable will be converted to id. For example, the query in Listing 3 will have SV = {?x, ?y} and SQ = {�?x, 11, ?y�, �?y, 12, 6�}

Query planner
The query planner takes the query in a Triple-ID form obtained after parsing. It analyzes and creates a sequence of operations for execution. There are 3 basic operators used in VEDAS framework.
1 Upload: The operator uploads intermediate result R i from subquery sq j to GPU memory. It also indicates the index of ?x used in subquery sq j . 2 Join: The process that combines the results R i and R j to another result R k . The total column number of R k maybe greater than the total column number of R i and R j . 3 Index swap: We use the sort-merge join as the only one join method, requiring that the first variable of both V 1 and V 2 must be the same. Index swap is an operator that swaps the order of V i for preparing for the next join operation.
Our assumption is that the operators in all subqueries are processed in a sequential fashion and the join operator is a binary join; not a multi-way join. The results of each operator form an intermediate result R i if the operator is processed at step i. All operators have a cost. The upload operator cost is the transfer time from the host memory to the GPU memory. Joining and index swapping are operators processed on the GPU. They are also not as fast as simple processing tasks. For a given query Q, the query planner component creates the order of the above 3 operators to construct the final query result. The process order of operators is directly impact to the performance of the query. If the order is well-arranged, the number of index swap operators can be reduced, which can increase the performance. However, sometimes we can increase the number of index swap operators for small intermediate results to decrease the number of join operators for a large data set. However, the query planning problem is known to be NP-hard.
In this paper, we assume to use a manual static scheduler to manage the order of operations. The strategy is to interleave the upload and join operations if possible. The order of the upload and join operations is determined by the triple pattern based on the SPARQL query. This approach constructs the good enough left-deep plan for evaluating the framework.

Query executor
After obtaining the order of operators from the query planner, the query executor executes the sequence of operators and stores the intermediate results in the GPU memory. The final results are transferred from the GPU memory back to the host after finishing all operators. The query executor contains several subcomponents according to the basic operators supported in the previous section. First, bounded variables in B sq are used to identify indices to be used. For example if sq = �?z, 4, 5� , the datasets that are indexed by predicate/object ( D POS ) and object/predicate ( D OPS ) can be used. Because we store the triples in the column-oriented fashion, we can upload only the related columns instead of all columns. The columns to be uploaded are only the columns of free variables that are matched from the index. The resulting consecutive rows are selected based on the range LowerOffset to UpperOffset (Lines 2−4). In Line 3, LowerOffset and UpperOffset can be compact based on the information from other triples or after joining the results. The filter technique to filter these tuples will be described in section "Pre-upload filtering".
For sq 1 , if we use the index D PSO , the data contains 2 columns and sorted by the subject (or column of ?x). If D POS is used, the data also contains 2 columns but are sorted by the object (or ?y). In this case, D POS is selected because the results sorted by ?y can immediately be joined with results from sq 2 that are also sorted by ?y. The subquery sq 2 can use both D OPS and D OSP and obtain the same results.

Join operator
The results of subqueries are usually joined together. Let R i ⊲⊳ R j denote the join of intermediate results of operators i and j respectively. Let R k = R i ⊲⊳ R j be the join results. R k composes of (V k , T k ) . Hence, Equations 1-2 show the new size of V k and T k

Fig. 8 Subquery results example
Let π i (T ) be the data i th column from T. For example R = (V , T ) with |V | = s , we denote the first column data of T by π 1 (T ) and denote π s (T ) for the last column. Algorithm 2 presents the join of R i and R j , where R i has r free variables and R j has s free variables. Line 1 combines all the variables. In Line 2, the system applies the inner join to the first column of T i and T j . The first column is always sorted. The modern GPU's sortmerge join [39] is used for inner join in this step. The join process also gets the index of rows that the first column matched to another intermediate results. After joining, the number of rows of the results is |T k | . We allocate the memory on the GPU with size |T k | × |V k | . Lines 3-4 collect the rows in T i and T j that correspond to the row index of the inner join processed by Line 2. The data will be merged into new data tuple T k . Line 5 updates the variables storing the bound used in the pre-upload filtering phase. Figure 9 shows an example of the join operation. In Fig. 9a, R i has two free variables ?x, ?y and in Fig. 9b, R j , has three free variables ?x, ?z, ?w. The resulting join R k has free variables ?x, ?y, ?z and ?w whose triple results are as in Fig. 9c.

Pre-upload filtering
To reduce the number of tuples to upload to the GPU memory, we bound the number of results before uploading. This is done by this preliminary filtering, called Pre-Upload Filtering phase. Suppose we have 2 intermediate results R i and R j , let ?w be the first free variable in T i . Suppose id of ?w is ranged from 1052 to 2654 for R i and for R j , it contains ?w with range 1548 to 3654. We keep the bound of minimum and maximum id as (1052, 2654) for R i and (1548, 3654) for R j . For each tuple that contains id, . 9 Join example t i ∈ T i and t j ∈ T j , t i or t j will not be considered in the resulting set, e.g., the variable id whose value is less than 1548 or greater than 2654, will be discarded.
To keep the bound for each variable, we construct the dictionary for each variable that keeps the boundary of each variable id (minimum and maximum). We denote B(R i ) is a pair of minimum id and maximum id of π 1 (T i ).
Some variables may occur more than one time in query Q. Before uploading and after each join process, we update the bound B(R k ) . This pre-upload filter can reduce the data required to transfer to the GPU memory.

Index swap operator
In some cases, for a given R i and R j , the first variable of both may not be the same, which prevents the join operation. The index swap operation is the operation for swap the variables from some other column to be in the first one and sort them afterwards. The purpose of this operator is to make it possible to join.
Let S(R i , ?x) be an index swap function that swaps variable ?x to be the first position in V i list and sort tuple T i with ?x column. Figure 10 is an example of swapping variable ?z in the tuples. Figure 10b shows resulting tuples after sorting by ?z based on Fig. 10a.
The cost of index swapping is the cost to sort |T| tuples with |V| elements on the GPU. The index swap in the early stage may be time consuming while performing it in the later stage may be faster due to few numbers of columns and tuples. Note that in Thrust library, the parallel sort complexity is O(Nlog(N)/p). Hence, O( |T|log(|T|)/p) for p threads.
The star-shaped query is frequently found in SPARQL queries. It is a pattern of queries that has one node with high degree. A SPARQL query can also contain many star-shaped (a) (b) Fig. 10 Swap index by variable ?z patterns in one query. This shape pattern prevents us to use the index swap because it uses one variable to join many tuples. On the other hand, the linear-shaped query forces us to use the index swap operators.

Exploratory subquery
For query Q = {sq 1 , sq 2 , ..., sq k } that has at least one subquery sq i = �e 1 , e 2 , e 3 � where e 1 , e 2 and e 3 are free variables. This subquery is called an exploration subquery. Such exploratory subquery requires to upload all triples in the data set since all three are free variables. However, with the help of pre-upload filter for given id, we bound the id values to reduce the number of tuples which is to reduce the data transfer to the GPU memory and the index swapping time. Figure 11 summarizes the overall activity of VEDAS framework. We describe the work done on both host and GPU sides. The SPARQL query is first parsed into SV and SQ. Next, the query planner is constructed. For each subquery sq i , the system finds the proper index and use pre-upload filter to throw away the out of range tuples before uploading the remainder to the GPU. The upload and join operators can be interleaved. This scheme helps tighten the bound of free variables before uploading. Upon the completion of all upload and join tasks, the final result is transferred back to the host memory. The final result contains only ids from related tuples, therefore dictionary mapping and decoding steps are necessary to transform the triple-IDs back to the original forms.

Example
Consider the RDF data in Fig. 3. Figure 4 is based on the data D. At the initialization, VEDAS creates a dictionary used to hash to transform the terms to ids. In Fig. 6b, the converted triple-IDs along with the dictionary (Fig. 6a) are shown. The indexer performs the sorting on the converted Triple-IDs to create 6 permuted indexed data. This step results in the triple-ID data with 6 indices i.e. D POS , D PSO , D OPS , D OSP , D SPO , and D SOP . The example of sorted Triple-ID with SOP index and column-based data is shown in Fig. 7. All components and data in this initialized step are processed on the host side.
The data set D in the previous example is suitable for explanation but it is too small to use for the query example. From this point, we will assume that D is much larger than in the previous example.
The execution plan tree is shown in Fig. 12. Operations 1 and 2 use D POS (or D OPS ) indices and upload them. The intermediate results R 1 and R 2 have only 1 free variable ?x ( V 1 = V 2 = ?x ). For the operation 4, we can upload only D PSO and the result R 4 has two variables ( V 4 = ?x, ?y ). If ?x id of R 1 , R 2 and R 4 is ranged from (1, 1347), (35,1998) and (48, 1595) respectively. When uploading the R 1 and R 2 to the GPU memory, it can filter out ?x that is not in the range (max (1, 35, 48), min(1347, 1998, 1595)) or (48, 1347) by using such bounding. Notice that these ranges can be known before uploading from the index data. Figure 13   The result R 5 has 2 columns V 5 = �?x, ?y� . To join with R 7 where V 7 = �?y� , ?y must be moved to the first column by operation 7. After swapping the index with ?y, we can join R 6 with R 7 and get the 2-column tuples that lead to the join in operation 8. The final result is returned to the host and it is decoded using the dictionary to obtain the terms back in the original form. Figure 13 presents a numerical example of operations 1-8. Assume Fig. 13a and c be the intermediate results obtained from operations 1 and 2 that correspond to sq 1 and sq 2 respectively. These results are not yet applied the pre-upload filter. Applying the preupload filter, in Fig. 13b and d, the bound (48, 1347) is used. The tuples whose id of ?x valued less than 48 or greater than 1347 are eliminated and will not be uploaded to the GPU memory. The intermediate result of R 1 ⊲⊳ R 2 is shown in Fig. 13e. It consists of ids which are in both R 1 and R 2 . Similarly, Fig. 13f presents the uploaded tuples from the results of sq 3 . After filtering with the bound (48, 934) with the pre-upload filter, Fig. 13f shows the reduced tuples. In this step, the bound is updated due to R 3 . The intermediate result R 5 in Fig. 13h is obtained from R 3 joining with R 4 . The output has 2 columns: ?x and ?y. Because R 5 cannot be joined with R 7 that is indexed by free variable ?y, it is required to swap the index, resulting R 6 in Fig. 13i. The bound is also updated to (89, 900). Instead of upload R ′ 7 in Fig. 13j, the tuples are filtered as in Fig. 13k, for R 7 . R 7 is then uploaded. The final result is shown in Fig. 13l which is obtained from R 6 ⊲⊳ R 7 , i.e., selecting only matched ?y and adding the corresponding ?x in same tuple.

Experiments
We compare VEDAS with gStore [17] and RDF-3X [40] which are the state of art for open source RDF stores. The storage size and query processing time are measured. WatDiv [41] test suite and LUBM [42] are used as benchmarks. WatDiv is a SPARQL benchmark that has different query structures and workload sizes. The generated queries have 4 categories: linear queries (L), star queries (S), snowflake-shaped queries (F) and complex queries (C).
The experiments are performed on a system with the following specification: 64 CPU of Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz with 256 GB of memory. The system contains 4 NVIDIA Tesla V100s 32GB memory with CUDA 10.1. Our source code is implemented using C++ and NVIDIA Thrust library. Modern-GPU library (sortmerge join) [39] is used for the joining process. We use only 1 GPU in the experiments.
We also compare the two options of VEDAS including on-demand upload and pre-upload. The on-demand option is to upload the triple data to the GPU memory based on the filter and selection method described in Section "VEDAS framework and operations" indicated by the subquery. That is it uploads the triple-IDs only when needed. The pre-upload option uploads all the triple data (excludes indices) into the GPU memory. When a part of triple data is needed, the necessary rows will be moved to another place on the GPU memory. This results in the reduction in the upload time while increasing the storage used in the GPU memory. The performance of both options are compared while considering the pre-upload filtering algorithm and on GPU memory approach. At last, the analysis on the computing time and data transfer are performed. We demonstrate the query cases where the speedup can be gained. We also explain how the query planner can help improve the speedup.

Storage size
First, we compare the size of data files that each framework generated as a measurement. RDF-3X compresses all data into one file. gStore has many files for each dataset. In VEDAS, there is a .vdd file for storing dictionary data and .vds file for storing triple data. We calculate the total size of all these files. Table 3 shows storage used in bytes for each framework for WatDiv datasets 100 M, 300 M and 500 M N-Triples. VEDAS requires less memory than that is required by RDF-3X which is much less than gStore.

Query time
WatDiv has 20 queries categorized by 4 patterns: C, F, S, and L. For RDF-3X and gStore, the query time for each query excluding the loading data to memory time are measured. Table 4 presents the query time for each query. The summation of each query time in each category is shown in each query class time in Table 5 for each workload and query class for different N-Triple sizes. Table 6 displays the speedup of VEDAS query times (in both the on-demand and pre-upload cases) compared to those of RDF-3X and gStore.
NVIDIA nvprof is used to profile the proportion of computation time and upload time for 300M data case. The values are reported in Table 7. We also summarize the size of intermediate result of each operation (upload, join and index swap) to compare with the proportion of the computation time and the upload time. Figures 14, 15 and 16 compare the query times for each query class for all RDF-3X, gStore, and VEDAS. VEDAS obtains the speedup more especially on class C. This is because the intermediate result after the join operation is large; therefore, the GPU performs better in this case (see Table 7). Although almost of queries can get benefit from GPU, there are some queries that perform far better than the other systems e.g. C2, L2, L4 and L5. The query time and number of results for these queries are shown in Tables 4  and 8. For L2, L4, and L5, the computation time of the queries take a lot of proportion compared to the upload time. The uploaded data is small; therefore, these queries are suitable for processing in GPU. C2 also has a high join-upload ratio (computation-transfer ratio), but not high as L2, L4, and L5. This query has a high computation work, i.e., it has 1M data for joining and 0.7M for sorting. Consequently, this leads to significant computation time on the CPU. S1 is the type of query that VEDAS processes slowly. S1 query has the lowest computation-transfer ratio on the GPU. This query has 17.5M rows to upload and only 183 rows  Table 4 Query time of RDF-3X, gStore and VEDAS for each query in milliseconds  Table 5 Query time of RDF-3X, gStore and VEDAS for each query class in milliseconds  to join (in WatDiv300M). It is slightly slower than the CPU approach, which has a lower uploading cost. Overall, the GPU approach is superior for the queries with many joins and high computation-transfer ratio queries. For queries with large upload data and without many joins, it will yield a small speedup. This leaves us the opportunity to optimize the upload time and reduce the size of intermediate results using a better query planner in the future.
From the result, it is obvious that VEDAS with pre-upload gains more speedup than VEDAS on-demand. However, the on-demand case still has a close performance to the pre-upload case. It turns out that the on-demand approach is practical because it can save a lot of GPU memory with a small increase in processing time. Adding the pre-upload filter can significantly improve performance.
LUBM is synthetic benchmark like WatDiv. We generated 1024 universities dataset which is about 140 million triples. We consider 3 complex queries (L1, L2 and L7) and 1 simple query (L4) in [43] and [44]. L1 and L7 are queries containing cycles and they require the join operation with 2 variables. The queries have a large intermediate result size but a small final result size. L4 is a star query with high selectivity. Finally, L2 contains two triples with low selectivity. The query times for each case and the number of resulting rows are shown in Table 9. The speedup of VEDAS compared to RDF-3X and gStore are shown in Table 10.
VEDAS runtime for query L1 is notably faster than that of RDF-3X and gStore. The reason is that L1 is a high-computation query with large IR join. Therefore, it can take advantage of the GPU architecture. For L7 query, the join graph shape is the same as L1. However, the number of data to be joined is fewer, and there are larger data for uploading. As a result, this case gains speedup less than in L1. L2 query is the simple join with 2-triple data. The speedup is high because the join output has low selectivity and requires high computation. For L4, it is high selectivity and requires to upload triples many times which causes inefficiency in GPU processing. Table 11 shows the speedup of VEDAS and MAGiQ [23]. They are compared against RDF-3X. MAGiQ uses Matlab matrix library for GPU to process queries that is tuned and optimized for matrix processing. MAGiQ's experiment uses LUBM-10240 and difference CPU and GPU specifications, therefore we cannot compare the speedup directly. However, it can be seen that our speedup are better or close to the MAGiQ result for the same benchmark. Table 12 shows the percentage of time used by each operator for each query for the 500M data size case. It shows that about 50% of the queries take at least 50% of the time to upload the data. The slow query result is caused by using too much upload time. Thus, to improve the system in the future, the uploading process can further be optimized. The proper data can be selected to be uploaded and forced to reside in the GPU memory so that it can be reused several times. Another approach is to use the pinned host memory or increase the page size to reduce the transfer time.

Effect of data transfer
From the table, the join process time is also significant. There are some query types whose join operations take more than 60% of the time, e.g., S5 and S7, which require insignificant upload time. In other cases, the join process takes time at the second place after the upload time. Moreover, the join process is usually taken place after the filtering process. The more data that can be filtered out, the less time used for the join process, since the time used for the join operation is determined by the size of input relations and the size of intermediate results.
Because the data is large, GPU memory allocation and copy are also the time-consuming process. It takes about 3% -36% of query time. The memory usage optimization can be done by reusing the large pool of pre-allocated memory.

Encoding and decoding times
Since VEDAS represents triples using integer IDs to reduce the data transfer volume, it also requires the steps to transform the SPARQL query terms and decode the resulting triples back into strings. We use C++'s open hashing, std::unordered_map, to implement the hash dictionaries for encoding and decoding. In our experiment, the encoding times are almost insignificant. We can immediately encode the query terms using the hash dictionary with constant access time.
On the other hand, the decoding time depends on the query result size and the dictionary size. It takes more time than the time for encoding, but it is only a tiny fraction of the total query time. For example, it takes about 0.01 milliseconds to decode the results from the largest query C3 on WatDiv (500M). From our perspective, this acceptable time is based on the assumption that the size of the dictionary is not larger than the size of the main memory. For an enormously large dataset where the dictionary cannot reside in memory, a more complex data structure or index should be used to split the dictionary while keeping the small access time. Parallel processing for decoding is also possible for the large result case.

Effect of query planner
In this section, we will show that VEDAS has a potential to enhance performance by improving the query planner component. As explained in Subsection "Query planner",   our planner is the normal approach which considers from the leftmost subquery and join in the left-deep fashion. The bound of variables are updated after each join operation to reduce upload data. Since this plan orders the join by the left-to-right triple order, it does not consider some existing order which can result in a small number of results and can filter a lot of data to upload. Table 13 shows the order of operations for each plan type of query S5. U represents the upload operator, and the next 2 numbers are the row and column sizes that are uploaded. We use J for the join operator and show the size of 2 intermediate results (IR) and output size. The first column shows the case where all matching triples are uploaded to GPU memory and then joined. The second column is the case where the left-deep join plan with left-to-right triple order. The third column is also left-deep join plan but it uploads the largest data first. These 3 plans' query processing times are close to each other. The second plan used in VEDAS benefits a little from the pre-upload filter. The last column is the plan where we upload the 2 smallest data and then join results in 0 rows. Therefore, this is the fastest plan among the 4 plans. This example shows that if we apply cardinality estimation to predict the size of intermediate results for each join into query planner, we can select the best join order to process and reduce the data size of the next related upload.
Query C2 is one of WatDiv's most complex queries, having 10 triple types from all subqueries and 6 variables to join; hence it performs 10 upload operations and some index swap operations. This query can have numerous possible query plans and it is hard to find the optimal one. Table 14 shows the result of experiments with 3 query plans with operations summation and query times. The first plan (first column) separates the triple pattern into 2 groups, uses left-deep join for each group and join 2 groups at the final stage. The second column is like the previous one but separates into 3 groups which form a star-shape. After processing each group, the subquery results are joined together. The last column plan is the complete left-deep join plan that results in many index swap operations. From the experiments, the third plan is, as might be expected, the slowest. This is because it has many index swap operations. The first plan uploads more data than the second, while the total query time is faster. This is because the second plan has more computation which dominates the query time. The query planner that considers the GPU operation cost can help to select an efficient plan. The accurate CPU-GPU data transfer rate, latency and computation performance of each GPU model are required to improve the query plan.

Extension to other operations
Our current implementation focuses on the basic SPARQL form: SELECT (project the selected variables) and WHERE (join). For more advanced query forms, it can be implemented by the following guideline.

Optional
The OPTIONAL clause is equivalent to the left join operator. The simplest scheme for handling this clause is to first process all subqueries outside the OPTIONAL clause. Then the left join operation is applied.  For example, in Listing 5, the first and second patterns (?person foaf:name ?name and ?person foaf:age 40) are joined using the inner join. After that, the results will be joined with the pattern in the OPTIONAL clause (?person foaf:homepage ?page) with the left inner join.

Union
The disjunction or union is the operator that combines 2 intermediate result sets. The straightforward approach is to use a set union operation which can perform in parallel in O(n).

Filter
FILTER is, perhaps, the most challenge clause. It may contain complex expressions and high-level functions such as regex, substr, strlen, concat, etc. We can insert some information in the encoded integer of id in a Triple-ID used for comparing the data, such as inserting some bits to specify the datatype and the rest of bits is ordered corresponding to the order of the raw values. This scheme makes it possible to compare the literal types and values. For simple filter expression like FILTER(?year > 2018), the value 2018 is encoded into the same format as id, year, and compared with the year values in the data store. We can bound the range of the filter variable id before uploading to the GPU memory.
Listing 6 shows SPARQL query with FILTER clause. In the example, it has a logical operator and (&&) which can be handled by adding more constraints to the variable bound. The query rewriting may be applied to handle or operator ( || ). For example, it may convert the FILTER with or operator into a UNION clause.
The high-level string function requires a mechanism to store the raw string or other string data structure to process with such function. We can construct the component to process the string function and obtain the resulting ids. After obtaining the results, the inner join can be used to join the triple results with the filtered ids to obtain the final results.

Order by
One property of VEDAS is that it always maintains the ascending order of the first column. Assume that all id order is the same as the literal order (by using technique in Subsection Filter. If the variable in ORDER BY matches the first column variable, the result list can be obtained immediately when using ASC. For DESC, the result list is sorted reversely. For other order patterns, GPU can directly perform the parallel sort before returning it to the user. We can consider the ORDER BY variables in a query planner. If the planner arranges the operator which matches the desired result order, the processing time can also be reduced.

Conclusion and future work
RDF query processing involves large triple data processing which can be time-consuming. This paper demonstrates to handle SPARQL query utilizing the thousands of threads in the GPU. The suitable data representation must be considered to compact the data and reduce the data transfer between GPU and CPU while utilizing the parallel threads effectively. We introduce a compact representation to store the triple data used in both host and GPU memory. A framework for querying the triple data with SPARQL processing utilizing the GPU is proposed. The triple data are converted into indexed column-based data called Triple ID. The triple data are stored in the host main memory and are uploaded the GPU when the query processing requires. The pre-upload filter is designed to reduce the data size, minimizing the transfer time. The uploaded data can be quickly accessed by indices. Index swapping operation is introduced to enable the GPU sorting and merge join. Then, the query plan for ordering the combination of upload, join and index swap can be created.
The experiments show that our approach achieves a speedup of 1.95 to 15.82 compared to RDF-3X and 2.76 to 397.03 compared to gStore. It is also shown that the ondemand upload and the pre-upload approaches yield the similar execution time. Thus, using on-demand upload may be a good choice. The timing results show the implication of using our approach to improve the query processing time based on the GPU. The analysis demonstrates the query types that can gain advantages from our framework.
There are many ways to further improve the performance such as: (1) To overcome the GPU memory limitation and scale the processing power, the extension to multi-GPU is an attractive solution. (2) Planing the operator order and parallelizing the operator tasks also increase the efficiency. (3) In the pre-upload filter process, we see that it can make the query faster if the eliminated triples is increased. The hashing function that can maximize the filtered tuples may be considered. (4) The new join algorithm for this new representation is another very interesting problem.

Hardware acceleration
Nowadays, multiprocessor architecture is popular and new types of accelerators are emerging. Our work here focuses only on the single NVIDIA GPU, but the technique and representation can be generalized to other accelerator types as well. The combination of integer ID format, indices with permutation orders and pre-upload filtering technique can also be applied to FPGA, TPU or vector processors. While the join operation should be optimized for each accelerator.

Application to Big Data databases
In addition to the hardware acceleration trend, database researchers are also paying an attention to big data database. We attempt to deal with very large structured and unstructured datasets. Our current work focuses on the large RDF datasets, which are semi-structured data. The approach is adaptable to a large-scale data database. For example, the source database may be stored in a distributed manner e.g. HADOOP File System (HDFS). Our preprocessing can be modified to convert them into Triple-IDs in parallel easily on the HDFS and the converted representation can be stored on it. The querying process can proceed as usual after obtaining the converted data from the file system. We have a plan to scale our work to multi-GPU and cluster machines. The parallelization of the querying process is divided into a cluster level and the multiple GPUs. The querying process will be modified to accommodate multiple GPUs and distributed querying.