1 Introduction

The last decade has seen an impressive growth of the Linked Open Data (LOD) community, which promotes to use the Resource Description Framework (RDF) to publicly share semi-structured data on the Web and to connect different data items by reusing HTTP International Resource Identifiers (IRIs) across data sources [3]. Besides HTTP access to RDF data, publishers also provide RDF dataset dumps (for download), and query endpoints that expose various capabilities, ranging from basic queries in RESTful APIs, such as Linked Data Fragments (LDF) [22], to SQL-like structured queries using SPARQL [9].

Although the LOD paradigm should provide access to a huge distributed knowledge base that can be browsed and queried online, efficient web-scale consumption of LOD has proven problematic in practice. Consider, for example, retrieving all entities with the label “Tim Berners-Lee”, which can be formulated in SPARQL as follows: select distinct ?x { ?x owl:sameAs*/rdfs:label "Tim Berners-Lee" }. Given the distributed nature of Linked Open Data, the resolution of this simple query would require one of the following approaches:

  • Download, index and query datasets locally. This approach is costly for the data consumer, who is likely to run into scalability issues.

  • Run a federated query against all known sources [5]. This approach is as good as the query endpoints that it relies on. Unfortunately, SPARQL endpoints are known to have low availability [7, 21], and federated queries are difficult to optimize beyond a limited number of sources [17].

  • Browse online sources in afollow-your-noseway [11]. This requires on-the-fly traversal of the universally distributed RDF graph. In practice, many IRIs do not dereference, and since our particular query does not contain an IRI at all (only a literal), it is not clear where graph traversal should start.

Thus, the three main approaches for querying LOD all have significant drawbacks, making it unfeasible to evaluate even simple queries on the Semantic Web [12]. Some of these issues are partially solved by services like DatahubFootnote 1 and LOD LaundromatFootnote 2 [2], which provide central catalogs for discovering and accessing cached versions of Linked Open Datasets. However, data consumers still need to navigate and process large corpora, consisting of thousands of dumps or endpoints, in order to evaluate queries or conduct large-scale experiments.

In this paper, we propose the LOD-a-lot dataset which offers low-cost consumption of a large portion of the LOD Cloud. We integrate 650 K datasets that are crawled by LOD Laundromat [2] into a single, self-indexed HDT [8] file. This HDT file is conveniently small and can be directly queried by data consumers with a limited memory footprint. LOD-a-lot contains 28 billion unique triples and, to the best of our knowledge, is the first approach to provide an indexed and ready-to-consume crawl of a large portion of the LOD Cloud that can be used offline. In addition, an online LDF interface to LOD-a-lot is provided.

The paper is organized as follows. Section 2 presents LOD-a-lot and its main benefits. Section 3 describes the available interfaces and tools to work with LOD-a-lot. We summarize LOD-a-lot statistics in Sect. 4, and describe potential use cases for it in Sect. 5. Section 6 concludes and devises future work.

2 LOD-a-lot: Concepts and Benefits

LOD-a-lot proposes an effective way of packaging a standards-compliant subset of the LOD Cloud into a ready-to-use file format.

LOD Laundromat [2] is a service that crawls, cleans and republishes Linked Open Datasets from Open Data portals like Datahub. As illustrated in Fig. 1, each dataset is cleaned to improve data quality: (i) syntax errors are detected and heuristics are used to recover from them; (ii) duplicate statements within datasets are removed; (iii) Skolemization is performed to replace blank nodes with well-known IRIsFootnote 3; and (iv) the cleaned dataset is lexicographically sorted. The current version (May 2015) is composed of 657,902 datasets and contains over 38 billion triples (including between-dataset duplicates). For each dataset a gzipped Canonical N-Triples file, an HDT file, and an LDF [22] endpoint are published.

Header-Dictionary-Triples (HDT) [8] is a binary compression format and – at the same time – a self-contained and queryable data store for RDF. HDT represents its main components (Dictionary and Triples) with compact data structures that enable storing, parsing and loading Big Semantic Data in compressed space. HDT data are indexed by subject, and therefore can be used to efficiently resolve subject-bounded Triple Pattern (TP) as well as unbounded queries [8]. HDT-Focused on Querying (HDT-FoQ) [15] extends HDT with two indexes (enabling predicate and object-based access, respectively) than can be created by the HDT consumer in order to speed up all TP queries. HDT can be used as a storage backend for large-scale graph data that achieves competitive query performance [15].

Linked Data Fragments (LDF) [22] is aimed at improving the scalability and availability of SPARQL endpoints by minimizing server-side processing and moving intelligence to the client. LDF allows simple Triple Patterns to be queried, where results are retrieved incrementally through pagination. Each of these pages (referred to as fragments) includes the estimated results and hypermedia controls (using the Hydra Vocabulary [14]), such that clients can perform query planning, retrieve all fragments, and join sub-query results locally. As such, server load is minimized and large data collections can be exposed with high availability. Given that HDT provides fast, low-cost TP resolution, LDF has been traditionally used in combination with HDT.

Fig. 1.
figure 1

LOD-a-lot overview and data flow.

In spite of the inherent benefits of LOD Laundromat to conduct large-scale experiments, consumers still need to access each dataset or endpoint independently over HTTP, which results in additional overheads when analyzing the corpus as a whole. LOD-a-lot tackles this issue and provides a unified view of all the data crawled and cleaned by the LOD Laundromat into one big knowledge graph. To do so, we carefully integrate the 650 K HDT datasets into a single HDT file. In order to improve the scalability of this process, we perform parallel and incrementally large merges of HDT files, integrating Dictionary and Triples components. In addition to the HDT file, we also create and expose the HDT-FoQ indexFootnote 4. The resulting HDT file is offered for download for local use and is exposed through an LDF endpoint for online use (Fig. 1).

The resultant LOD-a-lot dataset has the following properties:

  • Standards-compliance. The LOD Laundromat cleaning process and the HDT conversion guarantee that the indexed data is standards-compliant [2].

  • Volume & Variety. LOD-a-lot consists of over 28 billion triples (one of the largest single RDF dataset) and merges more than 650 K datasets, which cover a large subset of the topic domains in LOD.

  • Accessibility. The combination of HDT and LDF in LOD-a-lot allows users to perform structured queries through a uniform access point that is standards-compliant and self-descriptive through Hydra [14].

  • Scalability & Availability. Most LOD query endpoints are either exposing a small dataset, have low availability, or are too expensive to maintain. LOD-a-lot alleviates these problems for online and offline data consumption: HDT is highly compressed and can resolve triple pattern queries at rest, with limited memory footprint (in practice, 3% of the total dataset size). In turn, LDF deploys such functionality online and minimizes the server burden, pushing the composition of more complex queries to the client.

  • Ease of (re)use. Because LOD-a-lot is just one file, it can be downloaded, copied, or linked to easily.

  • Cost-effectiveness. Due to the HDT compression technique, the hardware footprint of LOD-a-lot is relatively small, requiring 524 GB of (solid-state) disk space and (when queried) 15.7 GB of RAM. At the time of writing the combined cost of these two hardware resources is approximately 305 euros.

3 Availability and Sustainability

LOD-a-lot is available at http://purl.org/HDT/lod-a-lot and listed in the datahub.io catalogFootnote 5, where we provide the following access to the dataset:

  • HDT Dump + HDT-FoQ index, released under the ODC PDDLFootnote 6 license.

  • LDF interface, to serve online SPARQL resolution using LDF clients.

  • VoID description of the dataset to aid automatic discovery services.

Because LOD-a-lot integrates 650 K+ datasets into one integrated RDF graph, it does not store the locations from which particular statements originate. This provenance information can be retrieved from LOD Laundromat, which stores the original source location, crawling metadata, and dataset metrics [20].

The sustainability of LOD-a-lot is supported by the joint effort of the LOD Laundromat and HDT projects. These projects, together with LDF, have been running for the last 3–6 years and are now well-established. We are creating an update policy for LOD-a-lot, to run in tandem with new LOD Laundromat crawls. The LOD-a-lot file can be used with a wealth of available HDT tools, including libraries for C++, Java, Node.js and SWI-Prolog. HDT tools are easily deployed using Docker and integrations with other open source projects (Apache Jena, Tinkerpop) exist.Footnote 7

The canonical citation for LOD-a-lot is “Fernández, J. D., Beek, W., Martínez-Prieto, M.A., and Arias, M. LOD-a-lot: A Queryable Dump of the LOD cloud (2017). http://purl.org/HDT/lod-a-lot .

4 LOD-a-lot Statistics Summary

A simple analysis of LOD-a-lot reports some interesting statistics. Table 1 compares the number of unique triples, and different subjects, predicates, and objects in our dataset. The two-rightmost columns also report the number of common subjects and objects, i.e. those terms playing both roles in the dataset, and the total number of literal objects. Results are in line with the widespread perception that the number of predicates is very limited w.r.t the number of triples (in this case, 1M distinct predicates in 28B triples. i.e. less than 0.004%) due to vocabulary reuse. A more elaborated analysis (Fig. 2, middle) shows that predicates follow a power-law distribution, where a long-tail of predicates is barely used while a limited set of predicates appears in a great number of triples.

Interestingly, almost the same number of subjects and objects (3B terms) are used in LOD-a-lot. The high proportion w.r.t the number of triples (11%) shows a low reuse of such terms. Figure 2 further elaborates on this and depicts subject (left) and object (right) distributions. Power-laws are reported in both cases, but a longer tail is drawn for objects with massive (up to 1B) repetitions. Finally, note two interesting numbers to understand the underlying dataset structure: (i) around 40% of subjects and objects play both roles, which means that it is easy to find chain paths of, at least, two connected triples; and (ii) more than 1.3B of objects are literals, so 41% of object nodes have no output links.

Table 1. LOD-a-lot summary statistics.
Fig. 2.
figure 2

Distribution of subjects, predicates, and objects in LOD-a-lot (log-log scale).

A space complexity analysis shows that the HDT LOD-a-lot dump encodes 28B triples in 304 GB: 133 GB are used for compressing the Dictionary, and 171 GB for the Triples component. HDT-FoQ indexes are also built to speed up all TP queries over the queryable dump: these additional structures use 220 GB.

Finally, we performed a deployment test (using the HDT-C++ library) on a modest computerFootnote 8, resulting in a load time of only 144 s and a memory footprint of 15.7 GB of RAM (\(\approx 3\%\) of the total dataset size). Furthermore, LDF queries (with 100 results as page size) are resolved at the level of milliseconds. This shows the LOD-a-lot affordable cost to manage and query 28B triples.

5 Relevance of the Dataset

This section describes three focused use cases for LOD-a-lot.

Query resolution at Web scale (UC1) is still an open challenge. Besides the aforementioned drawbacks of query federation [17, 18]) and follow-your-nose traversal querying [12], pioneer centralized approaches, such as Sindice [19], are already discontinued. The OpenLink Software’s LOD Cloud CacheFootnote 9 maintains a SPARQL endpoint of a portion of the LOD Cloud, but it only reports 4B triples and the system suffers from the traditional size/time restrictions of SPARQL endpoints and simple unbounded queries (e.g. the query in Sect. 1) incurs in timeouts. LOD-a-lot promotes query resolution at Web scale not only by actually serving such service for the indexed 28B triples, but it also shows the feasibility, scalability and efficiency of a centralized approach based on HDT and LDF.

Evaluation and benchmarking (UC2) have increasingly gained attention in the Semantic Web community [4]. However, Semantic Web evaluations still lack in terms of volume and variety. The Billion Triple Challenge (BTC) [13], the WebDataCommons Microdata, RDFa and Microformats dataset series [16] assist in this context by crawling RDF data from the Web and providing a single integrated dataset. However, BTC is limited to 4B triplesFootnote 10 and uses a minimum sample of each crawled data source, which provides an incomplete view of the data. In turn, the WebDataCommons dataset scales in size (44B triplesFootnote 11) but the focus is on Microdata and thus its variety and general application is very limited in practice. LOD Laundromat addresses this issue and republishes heterogeneous RDF datasets, but these have to be managed independently, which can result in a pain point for consumers. Thus, LOD-a-lot integrates the main advantages of all these proposals in terms of size (28B triples), variety (650K datasets) and single access point. LOD-a-lot is extremely easy and efficient to deploy in a local environment (via HDT), which allows Semantic Web academics and practitioners to run experiments over the largest and most heterogeneous, indexed and ready-to-consume RDF dataset.

RDF metrics and analytics (UC3) are widely adopted for SPARQL query optimization techniques [10] in order to find the optimal query plan. However, few studies inspect structural properties of real-world RDF data at Web scale [6] and, even those, only involve few million triples. More recently, the potential of LOD Laundromat has been exploited to characterize the quality of the data [1]. LOD-a-lot characteristics (see Sect. 2) democratize the computation of RDF metrics and analytics at Web scale (see the degrees in Fig. 2 as a practical example). Furthermore, particular metrics can take advantage of the HDT components in isolation, e.g. knowing the average length of URIs and literals would only scan the Dictionary (collecting all terms), whereas computing the in-degrees of object would only access the Triples (indexing the graph).

In addition, we also envision further practical applications for entity linking and data enrichment (e.g. leveraging in-links and owl:sameAs related entities), ranking of entities and vocabularies (e.g. analyzing their use), data summarization and other data mining techniques (e.g. finding commonalities in the data).

6 Conclusions and Future Work

The steady adoption of Linked Open Data (LOD) in recent years has led to a significant increase in the number and volume of RDF datasets. Today, problems such as data discovery and structured querying at web scale remain open challenges given the distributed nature of LOD.

This paper has presented LOD-a-lot, a simple and cost-effective way to query and study a large copy of the LOD Cloud. LOD-a-lot recollects all data gathered from the LOD Laundromat service and exposes a single HDT file, which can be queried online for free, and that can be downloaded locally and queried over commodity hardware. Requiring 524 GB of disk space and 15.7 GB of RAM, LOD-a-lot allows more than 28 billion unique triples to be queried using hardware costing – at the time of writing – 305 euro.

We plan to update LOD-a-lot regularly and include further datasets from the LOD Cloud. We are also working on a novel HDT variation to index quad information and thus keep track of the the input sources contributing to LOD-a-lot. Altogether, we expect LOD-a-lot to democratize the access to LOD and be one of the references for low-cost Web-scale evaluations.