LOD-a-lot

Fernández, Javier D.; Beek, Wouter; Martínez-Prieto, Miguel A.; Arias, Mario

doi:10.1007/978-3-319-68204-4_7

Javier D. Fernández^21,22,
Wouter Beek²³,
Miguel A. Martínez-Prieto²⁴ &
…
Mario Arias²⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10588))

Included in the following conference series:

International Semantic Web Conference

2504 Accesses
23 Citations

Abstract

LOD-a-lot democratizes access to the Linked Open Data (LOD) Cloud by serving more than 28 billion unique triples from 650 K datasets over a single self-indexed file. This corpus can be queried online with a sustainable Linked Data Fragments interface, or downloaded and consumed locally: LOD-a-lot is easy to deploy and demands affordable resources (524 GB of disk space and 15.7 GB of RAM), enabling Web-scale repeatable experimentation and research even by standard laptops.

You have full access to this open access chapter, Download conference paper PDF

LOD Lab: Scalable Linked Data Processing

Linked Data-as-a-Service: The Semantic Web Redeployed

BTC-2019: The 2019 Billion Triple Challenge Dataset

1 Introduction

The last decade has seen an impressive growth of the Linked Open Data (LOD) community, which promotes to use the Resource Description Framework (RDF) to publicly share semi-structured data on the Web and to connect different data items by reusing HTTP International Resource Identifiers (IRIs) across data sources [3]. Besides HTTP access to RDF data, publishers also provide RDF dataset dumps (for download), and query endpoints that expose various capabilities, ranging from basic queries in RESTful APIs, such as Linked Data Fragments (LDF) [22], to SQL-like structured queries using SPARQL [9].

Although the LOD paradigm should provide access to a huge distributed knowledge base that can be browsed and queried online, efficient web-scale consumption of LOD has proven problematic in practice. Consider, for example, retrieving all entities with the label “Tim Berners-Lee”, which can be formulated in SPARQL as follows: select distinct ?x { ?x owl:sameAs*/rdfs:label "Tim Berners-Lee" }. Given the distributed nature of Linked Open Data, the resolution of this simple query would require one of the following approaches:

Download, index and query datasets locally. This approach is costly for the data consumer, who is likely to run into scalability issues.
Run a federated query against all known sources [5]. This approach is as good as the query endpoints that it relies on. Unfortunately, SPARQL endpoints are known to have low availability [7, 21], and federated queries are difficult to optimize beyond a limited number of sources [17].
Browse online sources in a “follow-your-nose” way [11]. This requires on-the-fly traversal of the universally distributed RDF graph. In practice, many IRIs do not dereference, and since our particular query does not contain an IRI at all (only a literal), it is not clear where graph traversal should start.

Thus, the three main approaches for querying LOD all have significant drawbacks, making it unfeasible to evaluate even simple queries on the Semantic Web [12]. Some of these issues are partially solved by services like Datahub^{Footnote 1} and LOD Laundromat^{Footnote 2} [2], which provide central catalogs for discovering and accessing cached versions of Linked Open Datasets. However, data consumers still need to navigate and process large corpora, consisting of thousands of dumps or endpoints, in order to evaluate queries or conduct large-scale experiments.

In this paper, we propose the LOD-a-lot dataset which offers low-cost consumption of a large portion of the LOD Cloud. We integrate 650 K datasets that are crawled by LOD Laundromat [2] into a single, self-indexed HDT [8] file. This HDT file is conveniently small and can be directly queried by data consumers with a limited memory footprint. LOD-a-lot contains 28 billion unique triples and, to the best of our knowledge, is the first approach to provide an indexed and ready-to-consume crawl of a large portion of the LOD Cloud that can be used offline. In addition, an online LDF interface to LOD-a-lot is provided.

The paper is organized as follows. Section 2 presents LOD-a-lot and its main benefits. Section 3 describes the available interfaces and tools to work with LOD-a-lot. We summarize LOD-a-lot statistics in Sect. 4, and describe potential use cases for it in Sect. 5. Section 6 concludes and devises future work.

2 LOD-a-lot: Concepts and Benefits

LOD-a-lot proposes an effective way of packaging a standards-compliant subset of the LOD Cloud into a ready-to-use file format.

LOD Laundromat [2] is a service that crawls, cleans and republishes Linked Open Datasets from Open Data portals like Datahub. As illustrated in Fig. 1, each dataset is cleaned to improve data quality: (i) syntax errors are detected and heuristics are used to recover from them; (ii) duplicate statements within datasets are removed; (iii) Skolemization is performed to replace blank nodes with well-known IRIs^{Footnote 3}; and (iv) the cleaned dataset is lexicographically sorted. The current version (May 2015) is composed of 657,902 datasets and contains over 38 billion triples (including between-dataset duplicates). For each dataset a gzipped Canonical N-Triples file, an HDT file, and an LDF [22] endpoint are published.

Header-Dictionary-Triples (HDT) [8] is a binary compression format and – at the same time – a self-contained and queryable data store for RDF. HDT represents its main components (Dictionary and Triples) with compact data structures that enable storing, parsing and loading Big Semantic Data in compressed space. HDT data are indexed by subject, and therefore can be used to efficiently resolve subject-bounded Triple Pattern (TP) as well as unbounded queries [8]. HDT-Focused on Querying (HDT-FoQ) [15] extends HDT with two indexes (enabling predicate and object-based access, respectively) than can be created by the HDT consumer in order to speed up all TP queries. HDT can be used as a storage backend for large-scale graph data that achieves competitive query performance [15].

Linked Data Fragments (LDF) [22] is aimed at improving the scalability and availability of SPARQL endpoints by minimizing server-side processing and moving intelligence to the client. LDF allows simple Triple Patterns to be queried, where results are retrieved incrementally through pagination. Each of these pages (referred to as fragments) includes the estimated results and hypermedia controls (using the Hydra Vocabulary [14]), such that clients can perform query planning, retrieve all fragments, and join sub-query results locally. As such, server load is minimized and large data collections can be exposed with high availability. Given that HDT provides fast, low-cost TP resolution, LDF has been traditionally used in combination with HDT.

In spite of the inherent benefits of LOD Laundromat to conduct large-scale experiments, consumers still need to access each dataset or endpoint independently over HTTP, which results in additional overheads when analyzing the corpus as a whole. LOD-a-lot tackles this issue and provides a unified view of all the data crawled and cleaned by the LOD Laundromat into one big knowledge graph. To do so, we carefully integrate the 650 K HDT datasets into a single HDT file. In order to improve the scalability of this process, we perform parallel and incrementally large merges of HDT files, integrating Dictionary and Triples components. In addition to the HDT file, we also create and expose the HDT-FoQ index^{Footnote 4}. The resulting HDT file is offered for download for local use and is exposed through an LDF endpoint for online use (Fig. 1).

The resultant LOD-a-lot dataset has the following properties:

Standards-compliance. The LOD Laundromat cleaning process and the HDT conversion guarantee that the indexed data is standards-compliant [2].
Volume & Variety. LOD-a-lot consists of over 28 billion triples (one of the largest single RDF dataset) and merges more than 650 K datasets, which cover a large subset of the topic domains in LOD.
Accessibility. The combination of HDT and LDF in LOD-a-lot allows users to perform structured queries through a uniform access point that is standards-compliant and self-descriptive through Hydra [14].
Scalability & Availability. Most LOD query endpoints are either exposing a small dataset, have low availability, or are too expensive to maintain. LOD-a-lot alleviates these problems for online and offline data consumption: HDT is highly compressed and can resolve triple pattern queries at rest, with limited memory footprint (in practice, 3% of the total dataset size). In turn, LDF deploys such functionality online and minimizes the server burden, pushing the composition of more complex queries to the client.
Ease of (re)use. Because LOD-a-lot is just one file, it can be downloaded, copied, or linked to easily.
Cost-effectiveness. Due to the HDT compression technique, the hardware footprint of LOD-a-lot is relatively small, requiring 524 GB of (solid-state) disk space and (when queried) 15.7 GB of RAM. At the time of writing the combined cost of these two hardware resources is approximately 305 euros.

3 Availability and Sustainability

LOD-a-lot is available at http://purl.org/HDT/lod-a-lot and listed in the datahub.io catalog^{Footnote 5}, where we provide the following access to the dataset:

HDT Dump + HDT-FoQ index, released under the ODC PDDL^{Footnote 6} license.
LDF interface, to serve online SPARQL resolution using LDF clients.
VoID description of the dataset to aid automatic discovery services.

Because LOD-a-lot integrates 650 K+ datasets into one integrated RDF graph, it does not store the locations from which particular statements originate. This provenance information can be retrieved from LOD Laundromat, which stores the original source location, crawling metadata, and dataset metrics [20].

The sustainability of LOD-a-lot is supported by the joint effort of the LOD Laundromat and HDT projects. These projects, together with LDF, have been running for the last 3–6 years and are now well-established. We are creating an update policy for LOD-a-lot, to run in tandem with new LOD Laundromat crawls. The LOD-a-lot file can be used with a wealth of available HDT tools, including libraries for C++, Java, Node.js and SWI-Prolog. HDT tools are easily deployed using Docker and integrations with other open source projects (Apache Jena, Tinkerpop) exist.^{Footnote 7}

The canonical citation for LOD-a-lot is “Fernández, J. D., Beek, W., Martínez-Prieto, M.A., and Arias, M. LOD-a-lot: A Queryable Dump of the LOD cloud (2017). http://purl.org/HDT/lod-a-lot .”

4 LOD-a-lot Statistics Summary

A simple analysis of LOD-a-lot reports some interesting statistics. Table 1 compares the number of unique triples, and different subjects, predicates, and objects in our dataset. The two-rightmost columns also report the number of common subjects and objects, i.e. those terms playing both roles in the dataset, and the total number of literal objects. Results are in line with the widespread perception that the number of predicates is very limited w.r.t the number of triples (in this case, 1M distinct predicates in 28B triples. i.e. less than 0.004%) due to vocabulary reuse. A more elaborated analysis (Fig. 2, middle) shows that predicates follow a power-law distribution, where a long-tail of predicates is barely used while a limited set of predicates appears in a great number of triples.

Interestingly, almost the same number of subjects and objects (3B terms) are used in LOD-a-lot. The high proportion w.r.t the number of triples (11%) shows a low reuse of such terms. Figure 2 further elaborates on this and depicts subject (left) and object (right) distributions. Power-laws are reported in both cases, but a longer tail is drawn for objects with massive (up to 1B) repetitions. Finally, note two interesting numbers to understand the underlying dataset structure: (i) around 40% of subjects and objects play both roles, which means that it is easy to find chain paths of, at least, two connected triples; and (ii) more than 1.3B of objects are literals, so 41% of object nodes have no output links.

Table 1. LOD-a-lot summary statistics.

Full size table

A space complexity analysis shows that the HDT LOD-a-lot dump encodes 28B triples in 304 GB: 133 GB are used for compressing the Dictionary, and 171 GB for the Triples component. HDT-FoQ indexes are also built to speed up all TP queries over the queryable dump: these additional structures use 220 GB.

Finally, we performed a deployment test (using the HDT-C++ library) on a modest computer^{Footnote 8}, resulting in a load time of only 144 s and a memory footprint of 15.7 GB of RAM (\(\approx 3\%\) of the total dataset size). Furthermore, LDF queries (with 100 results as page size) are resolved at the level of milliseconds. This shows the LOD-a-lot affordable cost to manage and query 28B triples.

5 Relevance of the Dataset

This section describes three focused use cases for LOD-a-lot.

Query resolution at Web scale (UC1) is still an open challenge. Besides the aforementioned drawbacks of query federation [17, 18]) and follow-your-nose traversal querying [12], pioneer centralized approaches, such as Sindice [19], are already discontinued. The OpenLink Software’s LOD Cloud Cache^{Footnote 9} maintains a SPARQL endpoint of a portion of the LOD Cloud, but it only reports 4B triples and the system suffers from the traditional size/time restrictions of SPARQL endpoints and simple unbounded queries (e.g. the query in Sect. 1) incurs in timeouts. LOD-a-lot promotes query resolution at Web scale not only by actually serving such service for the indexed 28B triples, but it also shows the feasibility, scalability and efficiency of a centralized approach based on HDT and LDF.

Evaluation and benchmarking (UC2) have increasingly gained attention in the Semantic Web community [4]. However, Semantic Web evaluations still lack in terms of volume and variety. The Billion Triple Challenge (BTC) [13], the WebDataCommons Microdata, RDFa and Microformats dataset series [16] assist in this context by crawling RDF data from the Web and providing a single integrated dataset. However, BTC is limited to 4B triples^{Footnote 10} and uses a minimum sample of each crawled data source, which provides an incomplete view of the data. In turn, the WebDataCommons dataset scales in size (44B triples^{Footnote 11}) but the focus is on Microdata and thus its variety and general application is very limited in practice. LOD Laundromat addresses this issue and republishes heterogeneous RDF datasets, but these have to be managed independently, which can result in a pain point for consumers. Thus, LOD-a-lot integrates the main advantages of all these proposals in terms of size (28B triples), variety (650K datasets) and single access point. LOD-a-lot is extremely easy and efficient to deploy in a local environment (via HDT), which allows Semantic Web academics and practitioners to run experiments over the largest and most heterogeneous, indexed and ready-to-consume RDF dataset.

RDF metrics and analytics (UC3) are widely adopted for SPARQL query optimization techniques [10] in order to find the optimal query plan. However, few studies inspect structural properties of real-world RDF data at Web scale [6] and, even those, only involve few million triples. More recently, the potential of LOD Laundromat has been exploited to characterize the quality of the data [1]. LOD-a-lot characteristics (see Sect. 2) democratize the computation of RDF metrics and analytics at Web scale (see the degrees in Fig. 2 as a practical example). Furthermore, particular metrics can take advantage of the HDT components in isolation, e.g. knowing the average length of URIs and literals would only scan the Dictionary (collecting all terms), whereas computing the in-degrees of object would only access the Triples (indexing the graph).

In addition, we also envision further practical applications for entity linking and data enrichment (e.g. leveraging in-links and owl:sameAs related entities), ranking of entities and vocabularies (e.g. analyzing their use), data summarization and other data mining techniques (e.g. finding commonalities in the data).

6 Conclusions and Future Work

The steady adoption of Linked Open Data (LOD) in recent years has led to a significant increase in the number and volume of RDF datasets. Today, problems such as data discovery and structured querying at web scale remain open challenges given the distributed nature of LOD.

This paper has presented LOD-a-lot, a simple and cost-effective way to query and study a large copy of the LOD Cloud. LOD-a-lot recollects all data gathered from the LOD Laundromat service and exposes a single HDT file, which can be queried online for free, and that can be downloaded locally and queried over commodity hardware. Requiring 524 GB of disk space and 15.7 GB of RAM, LOD-a-lot allows more than 28 billion unique triples to be queried using hardware costing – at the time of writing – 305 euro.

We plan to update LOD-a-lot regularly and include further datasets from the LOD Cloud. We are also working on a novel HDT variation to index quad information and thus keep track of the the input sources contributing to LOD-a-lot. Altogether, we expect LOD-a-lot to democratize the access to LOD and be one of the references for low-cost Web-scale evaluations.

Notes

1.
See https://datahub.io.
2.
See http://lodlaundromat.org.
3.
See https://www.w3.org/TR/rdf11-concepts/#section-skolemization.
4.
HDT creation took 64h & 170 GB RAM. HDT-FoQ took 8h & 250 GB RAM.
5.
See https://datahub.io/dataset/lod-a-lot.
6.
See https://opendatacommons.org/licenses/pddl/1-0/.
7.
See https://github.com/rdfhdt.
8.
8 cores (2.6 GHz), RAM 32 GB and a SATA HDD on Ubuntu 14.04.5 LTS.
9.
See http://lod.openlinksw.com/.
10.
See http://km.aifb.kit.edu/projects/btc-2014/.
11.
See http://webdatacommons.org/structureddata/2016-10/stats/stats.html.

References

Beek, W., Ilievski, F., Debattista, J., Schlobach, S., Wielemaker, J.: Literally better: analyzing and improving the quality of literals. Semant. Web J. (2017). http://www.semantic-web-journal.net/content/literally-better-analyzing-and-improving-quality-literals-1
Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: LOD laundromat: a uniform way of publishing other people’s dirty data. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 213–228. Springer, Cham (2014). doi:10.1007/978-3-319-11964-9_14
Google Scholar
Bizer, C., Heath, T., Berners-Lee, T.: Linked data: the story so far. Int. J. Semant. Web Inf. Syst. 5(3), 1–22 (2009)
Article Google Scholar
Boncz, P., Fundulaki, I., Gubichev, A., Larriba-Pey, J., Neumann, T.: The linked data benchmark council project. Datenbank-Spektrum 13(2), 121–129 (2013)
Article Google Scholar
Buil-Aranda, C., Arenas, M., Corcho, O., Polleres, A.: Federating queries in SPARQL 1.1: syntax, semantics and evaluation. JWS 18(1), 1–17 (2013)
Article Google Scholar
Ding, L., Finin, T.: Characterizing the semantic web on the web. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 242–257. Springer, Heidelberg (2006). doi:10.1007/11926078_18
Chapter Google Scholar
Ermilov, I., Lehmann, J., Martin, M., Auer, S.: LODStats: the data web census dataset. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 38–46. Springer, Cham (2016). doi:10.1007/978-3-319-46547-0_5
Chapter Google Scholar
Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). JWS 19, 22–41 (2013)
Article Google Scholar
Garlik, S.H., Seaborne, A., Prud’hommeaux, E.: SPARQL 1.1 query language. W3C Recommendation (2013).https://www.w3.org/TR/sparql11-query/
Gubichev, A., Neumann, T.: Exploiting the query structure for efficient join ordering in SPARQL queries. In: Proceedings of EDBT, pp. 439–450 (2014)
Google Scholar
Hartig, O.: SQUIN: a traversal based query execution system for the web of linked data. In: Proceedings of SIGMOD, pp. 1081–1084 (2013)
Google Scholar
Hartig, O., Pirró, G.: A context-based semantics for SPARQL property paths over the web. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 71–87. Springer, Cham (2015). doi:10.1007/978-3-319-18818-8_5
Chapter Google Scholar
Käfer, T., Harth, A.: Billion Triples Challenge Data Set (2014). http://km.aifb.kit.edu/projects/btc-2014/
Lanthaler, M., Gütl, C.: Hydra: A Vocabulary for Hypermedia-Driven Web APIs. In: CEUR, vol. 996 (2013)
Google Scholar
Martínez-Prieto, M.A., Arias Gallego, M., Fernández, J.D.: Exchange and consumption of huge RDF data. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 437–452. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30284-8_36
Chapter Google Scholar
Meusel, R., Petrovski, P., Bizer, C.: The webdatacommons microdata, RDFa and microformat dataset series. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Cham (2014). doi:10.1007/978-3-319-11964-9_18
Google Scholar
Millard, I.C., Glaser, H., Salvadores, M., Shadbolt, N.: Consuming multiple linked data sources: challenges and experiences. In: Proceedings of COLD, vol. 665, pp. 37–48. CEUR (2010)
Google Scholar
Oguz, D., Ergenc, B., Yin, S., Dikenelli, O., Hameurlain, A.: Federated query processing on linked data: a qualitative survey and open challenges. Knowl. Eng. Rev. 30(5), 545–563 (2015)
Article Google Scholar
Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: Sindice.com: a document-oriented lookup index for open linked data. Int. J. Metadata Semant. Ontol 3(1), 37–52 (2008)
Article Google Scholar
Rietveld, L., Beek, W., Hoekstra, R., Schlobach, S.: Meta-data for a lot of LOD. Semantic Web J. 8(6), 1067–1080 (2017)
Article Google Scholar
Vandenbussche, P.Y., Umbrich, J., Matteis, L., Hogan, A., Buil-Aranda, C.: SPARQLES: Monitoring public SPARQL endpoints. Semantic Web J. 8(6), 1049–1065 (2017)
Article Google Scholar
Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L., De Meester, B., Haesendonck, G., Colpaert, P.: Triple pattern fragments: a low-cost knowledge graph interface for the web. JWS 37–38, 184–206 (2016)
Article Google Scholar

Download references

Acknowledgments

Partly funded by Austrian Science Fund: M1720-G11, European Union’s Horizon 2020 research and innovation programme under grant 731601, WU Post-doc Research Contracts, and MINECO, Spain: TIN2013-46238-C4-3-R, and TIN2016-78011-C4-1-R. We also thank the KEYSTONE COST Action IC1302.

Author information

Authors and Affiliations

Vienna University of Economics and Business, Vienna, Austria
Javier D. Fernández
Complexity Science Hub Vienna, Vienna, Austria
Javier D. Fernández
Department of Computer Science, VU University Amsterdam, Amsterdam, Netherlands
Wouter Beek
Department of Computer Science, Universidad de Valladolid, Segovia, Spain
Miguel A. Martínez-Prieto
Mario Arias Software, London, UK
Mario Arias

Authors

Javier D. Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Wouter Beek
View author publications
You can also search for this author in PubMed Google Scholar
Miguel A. Martínez-Prieto
View author publications
You can also search for this author in PubMed Google Scholar
Mario Arias
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Javier D. Fernández .

Editor information

Editors and Affiliations

University of Bari, Bari, Italy
Claudia d'Amato
KMi, The Open University, Milton Keynes, United Kingdom
Miriam Fernandez
University of Liverpool, Liverpool, United Kingdom
Valentina Tamma
Accenture Technology Labs, Dublin, Ireland
Freddy Lecue
University of Fribourg, Fribourg, Switzerland
Philippe Cudré-Mauroux
Capsenta, Inc., Austin, Texas, USA
Juan Sequeda
Universität Bonn, Bonn, Germany
Christoph Lange
Lehigh University, Bethlehem, Pennsylvania, USA
Jeff Heflin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fernández, J.D., Beek, W., Martínez-Prieto, M.A., Arias, M. (2017). LOD-a-lot. In: d'Amato, C., et al. The Semantic Web – ISWC 2017. ISWC 2017. Lecture Notes in Computer Science(), vol 10588. Springer, Cham. https://doi.org/10.1007/978-3-319-68204-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-68204-4_7
Published: 04 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68203-7
Online ISBN: 978-3-319-68204-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LOD-a-lot

Abstract